RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2004416 - fails to revert snapshot of a VM [balloon/page-poison]
Summary: fails to revert snapshot of a VM [balloon/page-poison]
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: qemu-kvm
Version: 8.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Dr. David Alan Gilbert
QA Contact: Yumei Huang
Jiri Herrmann
URL:
Whiteboard:
Depends On: 1984401 2027716
Blocks: 2004535
TreeView+ depends on / blocked
 
Reported: 2021-09-15 08:51 UTC by Dr. David Alan Gilbert
Modified: 2022-11-07 02:19 UTC (History)
17 users (show)

Fixed In Version: qemu-kvm-6.2.0-1.module+el8.6.0+13725+61ae1949
Doc Type: Bug Fix
Doc Text:
.VM migration and snapshots no longer failing due to `virtio-balloon` Previously, attempting to migrate a virtual machine (VMs) with a more recent guest operating system (such as RHEL 9) failed if the VM was using the `virtio-balloon` device. Similarly, creating a snapshot of such a VM failed. This update fixes a bug in the `page poison` feature of `virtio-balloon`, which prevents the described problem from occurring.
Clone Of: 1984401
: 2004535 (view as bug list)
Environment:
Last Closed: 2022-05-10 13:21:40 UTC
Type: Bug
Target Upstream Version:
Embargoed:
yuhuang: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-97214 0 None None None 2021-09-15 08:54:18 UTC
Red Hat Product Errata RHSA-2022:1759 0 None None None 2022-05-10 13:23:06 UTC

Description Dr. David Alan Gilbert 2021-09-15 08:51:35 UTC
Migration break when running recent guest (i.e. RHEL9) with a recent machine type (qemu 5.1? or newer?)
Don't expect it currently affects RHOS since they use the older -rhel7.6 machine type.
Will need to decide which 8-AV to Z

+++ This bug was initially created as a clone of Bug #1984401 +++

Description of problem:
RHEL9 host fails to revert a snapshot of the the RHEL9 guest VM using `virsh snapshot-revert`.

Version-Release number of selected component (if applicable):
libvirt-7.5.0-1.el9.x86_64
libvirt-client-7.5.0-1.el9.x86_64
qemu-kvm-6.0.0-7.el9.x86_64
virt-install-3.2.0-5.el9.noarch

How reproducible:
always

Steps to Reproduce:
1. Download attached kickstart file (kickstart.cfg).
2. From the directory with the downloaded kickstart file run:
virt-install --connect=qemu:///system --name=test_vm --memory=3072 --vcpus=2 --network bridge=virbr0 --disk size=20,format=qcow2 --initrd-inject kickstart.cfg --extra-args="inst.ks=file:/kickstart.cfg inst.cmdline console=ttyS0,115200 inst.addrepo=AppStream,http://download-01.eng.brq.redhat.com/rhel-9/composes/RHEL-9/RHEL-9.0.0-20210719.1/compose/AppStream/x86_64/os/" --serial pty --graphics=none --noautoconsole --wait=-1 --location=http://download-01.eng.brq.redhat.com/rhel-9/composes/RHEL-9/RHEL-9.0.0-20210719.1/compose/BaseOS/x86_64/os/
3. virsh snapshot-create test_vm
4. virsh snapshot-list test_vm
 Name         Creation Time               State
---------------------------------------------------
 1626865894   2021-07-21 07:11:34 -0400   running
5. virsh snapshot-revert test_vm --current
error: operation failed: Failed to load snapshot: VQ 0 size 0x80 < last_avail_idx 0x0 - used_idx 0x44
Failed to load virtio-balloon:virtio
error while loading state for instance 0x0 of device '0000:00:01.4:00.0/virtio-balloon'
Error: Error -1 while loading VM state
6. virsh list
 Id   Name      State
------------------------
 2    test_vm   paused
7. virsh snapshot-dumpxml test_vm 1626865894 (attached as snapshot.xml)


Actual results:
virsh snapshot-revert fails to revert a VM to the specified snapshot.

Expected results:
virsh snapshot-revert reverts a VM to the specified snapshot.

Additional info:

--- Additional comment from Matus Marhefka on 2021-07-21 12:26:00 BST ---



--- Additional comment from Matus Marhefka on 2021-07-21 12:40:29 BST ---

Dump file created from the affected VM:
https://drive.google.com/file/d/1bU1hM1jjhW4RT5JbXIogNesJ1m5Rztmf/view?usp=sharing

--- Additional comment from Matus Marhefka on 2021-07-28 09:57:49 BST ---

libvirtd.log produced with the following settings in /etc/libvirt/libvirtd.conf:

log_filters="3:remote 4:event 3:util.json 3:util.object 3:util.dbus 3:util.netlink 3:node_device 3:rpc 3:access 1:*"
log_outputs="1:file:/var/log/libvirt/libvirtd.log"

--- Additional comment from Peter Krempa on 2021-07-28 15:05:19 BST ---

So I've analyzed the debug log and this happens inside of a single QEMU run where the snapshot was taken (savevm) and restored right away (loadvm). No other operations were done in the meanwhile, thus this will most probably be a QEMU bug.

The VM was started with the following commandline:

/usr/libexec/qemu-kvm \
-name guest=test_vm,debug-threads=on \
-S \
-object '{"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-3-test_vm/master-key.aes"}' \
-machine pc-q35-rhel8.5.0,accel=kvm,usb=off,dump-guest-core=off,memory-backend=pc.ram \
-cpu Broadwell,vme=on,ss=on,vmx=on,f16c=on,rdrand=on,hypervisor=on,arat=on,tsc-adjust=on,umip=on,arch-capabilities=on,xsaveopt=on,pdpe1gb=on,abm=on,skip-l1dfl-vmentry=on,pschange-mc-no=on,aes=off \
-m 3072 \
-object '{"qom-type":"memory-backend-ram","id":"pc.ram","size":3221225472}' \
-overcommit mem-lock=off \
-smp 2,sockets=2,cores=1,threads=1 \
-uuid 72e59c07-066f-4cd7-9642-c4d7b1497c2c \
-display none \
-no-user-config \
-nodefaults \
-chardev socket,id=charmonitor,fd=31,server=on,wait=off \
-mon chardev=charmonitor,id=monitor,mode=control \
-rtc base=utc,driftfix=slew \
-global kvm-pit.lost_tick_policy=delay \
-no-hpet \
-no-shutdown \
-global ICH9-LPC.disable_s3=1 \
-global ICH9-LPC.disable_s4=1 \
-boot strict=on \
-device pcie-root-port,port=0x8,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x1 \
-device pcie-root-port,port=0x9,chassis=2,id=pci.2,bus=pcie.0,addr=0x1.0x1 \
-device pcie-root-port,port=0xa,chassis=3,id=pci.3,bus=pcie.0,addr=0x1.0x2 \
-device pcie-root-port,port=0xb,chassis=4,id=pci.4,bus=pcie.0,addr=0x1.0x3 \
-device pcie-root-port,port=0xc,chassis=5,id=pci.5,bus=pcie.0,addr=0x1.0x4 \
-device pcie-root-port,port=0xd,chassis=6,id=pci.6,bus=pcie.0,addr=0x1.0x5 \
-device pcie-root-port,port=0xe,chassis=7,id=pci.7,bus=pcie.0,addr=0x1.0x6 \
-device qemu-xhci,p2=15,p3=15,id=usb,bus=pci.2,addr=0x0 \
-device virtio-serial-pci,id=virtio-serial0,bus=pci.3,addr=0x0 \
-blockdev '{"driver":"file","filename":"/var/lib/libvirt/images/test_vm.qcow2","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-1-format","read-only":false,"driver":"qcow2","file":"libvirt-1-storage","backing":null}' \
-device virtio-blk-pci,bus=pci.4,addr=0x0,drive=libvirt-1-format,id=virtio-disk0,bootindex=1 \
-netdev tap,fd=34,id=hostnet0,vhost=on,vhostfd=35 \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:68:30:d9,bus=pci.1,addr=0x0 \
-chardev pty,id=charserial0 \
-device isa-serial,chardev=charserial0,id=serial0 \
-chardev socket,id=charchannel0,fd=36,server=on,wait=off \
-device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 \
-audiodev id=audio1,driver=none \
-device virtio-balloon-pci,id=balloon0,bus=pci.5,addr=0x0 \
-object '{"qom-type":"rng-random","id":"objrng0","filename":"/dev/urandom"}' \
-device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.6,addr=0x0 \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-msg timestamp=on

Here's a dump of QMP traffic from the initial 'cont' (truncated to 'cont' as cpu configuration made libvirt probe the CPU state which is ~300 QMP commands): Full log will be attached:

2021-07-28 13:39:20.964+0000: 28607: info : qemuMonitorSend:958 : QEMU_MONITOR_SEND_MSG: mon=0x7fdb2004a050 msg={"execute":"cont","id":"libvirt-385"}
2021-07-28 13:39:20.965+0000: 35960: info : qemuMonitorJSONIOProcessLine:236 : QEMU_MONITOR_RECV_EVENT: mon=0x7fdb2004a050 event={"timestamp": {"seconds": 1627479560, "microseconds": 965211}, "event": "RESUME"}
2021-07-28 13:39:20.966+0000: 35960: info : qemuMonitorJSONIOProcessLine:241 : QEMU_MONITOR_RECV_REPLY: mon=0x7fdb2004a050 reply={"return": {}, "id": "libvirt-385"}
2021-07-28 13:39:42.009+0000: 35960: info : qemuMonitorJSONIOProcessLine:236 : QEMU_MONITOR_RECV_EVENT: mon=0x7fdb2004a050 event={"timestamp": {"seconds": 1627479582, "microseconds": 8871}, "event": "VSERPORT_CHANGE", "data": {"open": true, "id": "channel0"}}
2021-07-28 13:39:43.579+0000: 35960: info : qemuMonitorJSONIOProcessLine:236 : QEMU_MONITOR_RECV_EVENT: mon=0x7fdb2004a050 event={"timestamp": {"seconds": 1627479583, "microseconds": 579207}, "event": "NIC_RX_FILTER_CHANGED", "data": {"name": "net0", "path": "/machine/peripheral/net0/virtio-backend"}}
2021-07-28 13:40:06.075+0000: 28609: info : qemuMonitorSend:958 : QEMU_MONITOR_SEND_MSG: mon=0x7fdb2004a050 msg={"execute":"stop","id":"libvirt-386"}
2021-07-28 13:40:06.086+0000: 35960: info : qemuMonitorJSONIOProcessLine:236 : QEMU_MONITOR_RECV_EVENT: mon=0x7fdb2004a050 event={"timestamp": {"seconds": 1627479606, "microseconds": 85789}, "event": "STOP"}
2021-07-28 13:40:06.090+0000: 35960: info : qemuMonitorJSONIOProcessLine:241 : QEMU_MONITOR_RECV_REPLY: mon=0x7fdb2004a050 reply={"return": {}, "id": "libvirt-386"}
2021-07-28 13:40:06.091+0000: 28609: info : qemuMonitorSend:958 : QEMU_MONITOR_SEND_MSG: mon=0x7fdb2004a050 msg={"execute":"human-monitor-command","arguments":{"command-line":"savevm \"1627479606\""},"id":"libvirt-387"}
2021-07-28 13:40:06.092+0000: 35960: info : qemuMonitorJSONIOProcessLine:236 : QEMU_MONITOR_RECV_EVENT: mon=0x7fdb2004a050 event={"timestamp": {"seconds": 1627479606, "microseconds": 92135}, "event": "MIGRATION", "data": {"status": "setup"}}
2021-07-28 13:40:06.094+0000: 35960: info : qemuMonitorJSONIOProcessLine:236 : QEMU_MONITOR_RECV_EVENT: mon=0x7fdb2004a050 event={"timestamp": {"seconds": 1627479606, "microseconds": 94650}, "event": "MIGRATION_PASS", "data": {"pass": 1}}
2021-07-28 13:40:08.860+0000: 35960: info : qemuMonitorJSONIOProcessLine:236 : QEMU_MONITOR_RECV_EVENT: mon=0x7fdb2004a050 event={"timestamp": {"seconds": 1627479608, "microseconds": 860364}, "event": "MIGRATION_PASS", "data": {"pass": 2}}
2021-07-28 13:40:08.866+0000: 35960: info : qemuMonitorJSONIOProcessLine:236 : QEMU_MONITOR_RECV_EVENT: mon=0x7fdb2004a050 event={"timestamp": {"seconds": 1627479608, "microseconds": 866070}, "event": "MIGRATION", "data": {"status": "completed"}}
2021-07-28 13:40:10.402+0000: 35960: info : qemuMonitorJSONIOProcessLine:241 : QEMU_MONITOR_RECV_REPLY: mon=0x7fdb2004a050 reply={"return": "", "id": "libvirt-387"}
2021-07-28 13:40:10.402+0000: 28609: info : qemuMonitorSend:958 : QEMU_MONITOR_SEND_MSG: mon=0x7fdb2004a050 msg={"execute":"cont","id":"libvirt-388"}
2021-07-28 13:40:10.403+0000: 35960: info : qemuMonitorJSONIOProcessLine:236 : QEMU_MONITOR_RECV_EVENT: mon=0x7fdb2004a050 event={"timestamp": {"seconds": 1627479610, "microseconds": 403812}, "event": "RESUME"}
2021-07-28 13:40:10.420+0000: 35960: info : qemuMonitorJSONIOProcessLine:241 : QEMU_MONITOR_RECV_REPLY: mon=0x7fdb2004a050 reply={"return": {}, "id": "libvirt-388"}
2021-07-28 13:40:20.274+0000: 28608: info : qemuMonitorSend:958 : QEMU_MONITOR_SEND_MSG: mon=0x7fdb2004a050 msg={"execute":"stop","id":"libvirt-389"}
2021-07-28 13:40:20.283+0000: 35960: info : qemuMonitorJSONIOProcessLine:236 : QEMU_MONITOR_RECV_EVENT: mon=0x7fdb2004a050 event={"timestamp": {"seconds": 1627479620, "microseconds": 283371}, "event": "STOP"}
2021-07-28 13:40:20.284+0000: 35960: info : qemuMonitorJSONIOProcessLine:241 : QEMU_MONITOR_RECV_REPLY: mon=0x7fdb2004a050 reply={"return": {}, "id": "libvirt-389"}
2021-07-28 13:40:20.284+0000: 28608: info : qemuMonitorSend:958 : QEMU_MONITOR_SEND_MSG: mon=0x7fdb2004a050 msg={"execute":"human-monitor-command","arguments":{"command-line":"loadvm \"1627479606\""},"id":"libvirt-390"}
2021-07-28 13:40:20.448+0000: 35960: info : qemuMonitorJSONIOProcessLine:236 : QEMU_MONITOR_RECV_EVENT: mon=0x7fdb2004a050 event={"timestamp": {"seconds": 1627479620, "microseconds": 448204}, "event": "VSERPORT_CHANGE", "data": {"open": false, "id": "channel0"}}
2021-07-28 13:40:25.721+0000: 35960: info : qemuMonitorJSONIOProcessLine:241 : QEMU_MONITOR_RECV_REPLY: mon=0x7fdb2004a050 reply={"return": "VQ 0 size 0x80 < last_avail_idx 0x0 - used_idx 0xaaaa\r\nFailed to load virtio-balloon:virtio\r\nerror while loading state for instance 0x0 of device '0000:00:01.4:00.0/virtio-balloon'\r\nError: Error -1 while loading VM state\r\n", "id": "libvirt-390"}

Reassigning to QEMU for further analysis

--- Additional comment from Peter Krempa on 2021-07-28 15:08:06 BST ---



--- Additional comment from Klaus Heinrich Kiwi on 2021-08-02 19:54:38 BST ---

Thanks Peter.

Max, care to take this one?

--- Additional comment from Hanna Reitz on 2021-08-05 10:28:00 BST ---

Hi,

I tried to reproduce the issue but failed so far.  I tried this command line:

qemu-system-x86_64 \
    -name guest=test_vm,debug-threads=on \
    -S \
    -machine pc-q35-rhel8.5.0,accel=kvm,usb=off,dump-guest-core=off,memory-backend=pc.ram \
    -cpu Broadwell,vme=on,ss=on,vmx=on,f16c=on,rdrand=on,hypervisor=on,arat=on,tsc-adjust=on,umip=on,arch-capabilities=on,xsaveopt=on,pdpe1gb=on,abm=on,skip-l1dfl-vmentry=on,pschange-mc-no=on,aes=off \
    -m 3072 \
    -object '{"qom-type":"memory-backend-ram","id":"pc.ram","size":3221225472}' \
    -overcommit mem-lock=off \
    -smp 2,sockets=2,cores=1,threads=1 \
    -uuid 72e59c07-066f-4cd7-9642-c4d7b1497c2c \
    -display gtk \
    -no-user-config \
    -nodefaults \
    -rtc base=utc,driftfix=slew \
    -global kvm-pit.lost_tick_policy=delay \
    -no-hpet \
    -no-shutdown \
    -global ICH9-LPC.disable_s3=1 \
    -global ICH9-LPC.disable_s4=1 \
    -device pcie-root-port,port=0x8,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x1 \
    -device pcie-root-port,port=0x9,chassis=2,id=pci.2,bus=pcie.0,addr=0x1.0x1 \
    -device pcie-root-port,port=0xa,chassis=3,id=pci.3,bus=pcie.0,addr=0x1.0x2 \
    -device pcie-root-port,port=0xb,chassis=4,id=pci.4,bus=pcie.0,addr=0x1.0x3 \
    -device pcie-root-port,port=0xc,chassis=5,id=pci.5,bus=pcie.0,addr=0x1.0x4 \
    -device pcie-root-port,port=0xd,chassis=6,id=pci.6,bus=pcie.0,addr=0x1.0x5 \
    -device pcie-root-port,port=0xe,chassis=7,id=pci.7,bus=pcie.0,addr=0x1.0x6 \
    -device qemu-xhci,p2=15,p3=15,id=usb,bus=pci.2,addr=0x0 \
    -device virtio-serial-pci,id=virtio-serial0,bus=pci.3,addr=0x0 \
    -blockdev '{"driver":"file","filename":"test.qcow2","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}' \
    -blockdev '{"node-name":"libvirt-1-format","read-only":false,"driver":"qcow2","file":"libvirt-1-storage","backing":null}' \
    -device virtio-blk-pci,bus=pci.4,addr=0x0,drive=libvirt-1-format,id=virtio-disk0,bootindex=1 \
    -netdev user,id=hostnet0 \
    -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:68:30:d9,bus=pci.1,addr=0x0 \
    -chardev pty,id=charserial0 \
    -device isa-serial,chardev=charserial0,id=serial0 \
    -chardev socket,id=charchannel0,path=/tmp/qemu-serial,server=on,wait=off \
    -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 \
    -audiodev id=audio1,driver=none \
    -device virtio-balloon-pci,id=balloon0,bus=pci.5,addr=0x0 \
    -object '{"qom-type":"rng-random","id":"objrng0","filename":"/dev/urandom"}' \
    -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.6,addr=0x0 \
    -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
    -msg timestamp=on \
    -cdrom arch.iso \
    -device virtio-vga,bus=pci.7 \
    -qmp stdio

(Differences: -qmp stdio instead of through a socket; also no fd passing in general; graphical output with virtio-vga; boot from an Arch Linux ISO; no master key secret)

Then I execute the following QMP commands:

{"execute": "qmp_capabilities"}
{"execute": "migrate-set-capabilities", "arguments": {"capabilities": [{"capability": "events", "state": true}]}}
{"execute": "cont"}

Then I let it boot, and then:

{"execute": "stop"}
{"execute": "human-monitor-command", "arguments": {"command-line": "savevm \"1627479606\""}}
{"execute": "cont"}
{"execute": "stop"}
{"execute": "human-monitor-command", "arguments": {"command-line": "loadvm \"1627479606\""}}

And I don’t get an error.

Since the error appears in the virtio-balloon device, I don’t know whether I should be actively doing anything with the device.  I’ve tried enabling guest-stats-polling, which worked (i.e. I got valid stats), so I suppose the device is in use by the guest.


I’ve also tried loading the VM state attached in comment 2, but that gives me:

2021-08-05T09:05:06.966966Z qemu-system-x86_64: Unknown savevm section or instance 'kvmclock' 0. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
2021-08-05T09:05:06.967007Z qemu-system-x86_64: load of migration failed: Invalid argument


On another note, I don’t think it’s likely that there’s a bug in the storage layer here.  Of course it’s entirely possible, for example if the migration stream is truncated or corrupted because of some qcow2 bug.  (And “used_idx 0xaaaa” does look weird, admittedly.)
But I guess to be able tell for sure I’d need to be able to reproduce it... :/

--- Additional comment from Hanna Reitz on 2021-08-16 18:11:57 BST ---

Hi,

Here are my findings from a VM state I generated on a test machine (thanks, Milan, for providing it to me):

First, from a qcow2 perspective, the image looks OK, the VM state looks intact.

Second, I copied off the VM state and compared the virtio-balloon part (because that’s the device that’s complaining) to a reference state that loaded fine.  In principle, both looked similar to my untrained eye (I don’t have any experience with migration states), the only difference was that my reference state was missing the page-poison section (“vitio-balloon-device/page-poison”, yes, it’s “vitio-”, without the R – I don’t know whether that may be a problem?).

Third, I presume my reference state didn’t have a page-poison section because the guest didn’t enable that feature (surprising to me, as my guest was Arch with a current kernel).  So I forced qemu to create such a section by having virtio_balloon_page_poison_support() return true unconditionally, and having virtio_balloon_config_size() return the size assuming that poison support were there.  Then, I got a very similar error to the one report in comment 0 (even on upstream):

> VQ 2 size 0x80 < last_avail_idx 0xe - used_idx 0xd94d
> Failed to load virtio-balloon:virtio
> error while loading state for instance 0x0 of device '0000:00:01.4:00.0/virtio-balloon'
> Error: Error -1 while loading VM state

I believe there’s something wrong with the page-poison feature in the migration stream.  I don’t think this bug is caused by qcow2 or any other storage feature – to double check, you could try to migrate to an external file (I’m not sure, I’ve never used it – but `virsh save`/`virsh restore` seem to be the commands for this) and see whether the error appears there, too.

Or you could try disabling the page-poison feature by adding `page-poison=off` to the virtio-balloon device parameters.  (If that makes the problem go away, then it probably is indeed a problem in the migration data.)

Can you try either of those things (or perhaps even both)?

Thanks!

Hanna

--- Additional comment from Tingting Mao on 2021-08-23 05:01:07 BST ---

Tried to reproduced this bug as below. The guest paused after 'loadvm' via HMP, not sure whether reproduced this issue exactly. And the issue here should be the same as bz 1959676 in rhel8.5-av. 


Tested with:
qemu-kvm-6.0.0-7.el9
kernel-5.14.0-0.rc4.35.el9.x86_64


Steps:
1. Boot up a guest with below CML
/usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine q35 \
    -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
    -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
    -nodefaults \
    -device VGA,bus=pcie.0,addr=0x2 \
    -m 15360  \
    -smp 16,maxcpus=16,cores=8,threads=1,dies=1,sockets=2  \
    -cpu 'Haswell-noTSX',+kvm_pv_unhalt \
    -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
    -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -object iothread,id=iothread0 \
    -object iothread,id=iothread1 \
    -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
    -device virtio-net-pci,mac=9a:1c:0c:0d:e3:4c,id=idjmZXQS,netdev=idEFQ4i1,bus=pcie-root-port-3,addr=0x0  \
    -netdev tap,id=idEFQ4i1,vhost=on  \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm \
    -monitor stdio \
    -device pcie-root-port,id=pcie-root-port-5,port=0x6,addr=0x1.0x5,bus=pcie.0,chassis=5 \
    -device virtio-scsi-pci,id=virtio_scsi_pci2,bus=pcie-root-port-5,addr=0x0 \
    -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=$1,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \
    -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \
    -chardev socket,server=on,path=/var/tmp/monitor-qmpmonitor1-20210721-024113-AsZ7KYro,id=qmp_id_qmpmonitor1,wait=off  \
    -mon chardev=qmp_id_qmpmonitor1,mode=control \


2. Connect to the console socke and save/load internal snapshot via QMP
# nc -U monitor-qmpmonitor1-20210721-024113-AsZ7KYro 
{"QMP": {"version": {"qemu": {"micro": 0, "minor": 0, "major": 6}, "package": "qemu-kvm-6.0.0-7.el9"}, "capabilities": ["oob"]}}
{"execute": "qmp_capabilities", "id": "kDiYPtqp"}
{"return": {}, "id": "kDiYPtqp"}
{"execute": "human-monitor-command", "arguments": {"command-line": "savevm sn1"}, "id": "JcqY2Shq"}
{"timestamp": {"seconds": 1629690456, "microseconds": 582098}, "event": "STOP"}

{"timestamp": {"seconds": 1629690556, "microseconds": 643581}, "event": "RESUME"}
{"return": "", "id": "JcqY2Shq"}
{"execute": "human-monitor-command", "arguments": {"command-line": "loadvm sn1"}, "id": "y5joUWoc"}
{"timestamp": {"seconds": 1629690566, "microseconds": 801531}, "event": "STOP"} 
{"return": "", "id": "y5joUWoc"} ---------------------------------------------------------> Only stop, not resume!



Results:
After step2, checked the status of the guest via HMP:
(qemu) info status
VM status: paused (restore-vm)

The guest paused.

--- Additional comment from Tingting Mao on 2021-08-23 06:38:10 BST ---

(In reply to Tingting Mao from comment #9)
> Tried to reproduced this bug as below. The guest paused after 'loadvm' via
> HMP, not sure whether reproduced this issue exactly. And the issue here
> should be the same as bz 1959676 in rhel8.5-av. 
> 
> 
> Tested with:
> qemu-kvm-6.0.0-7.el9
> kernel-5.14.0-0.rc4.35.el9.x86_64
But did NOT hit the issue in qemu-kvm-6.0.0-12.el9, resumed successed after 'loadvm'.

Steps are the same as the ones in Comment 9, and results are listed below:
# nc -U monitor-qmpmonitor1-20210721-024113-AsZ7KYro 
{"QMP": {"version": {"qemu": {"micro": 0, "minor": 0, "major": 6}, "package": "qemu-kvm-6.0.0-12.el9"}, "capabilities": ["oob"]}}
{"execute": "qmp_capabilities", "id": "kDiYPtqp"}
{"return": {}, "id": "kDiYPtqp"}
{"execute": "human-monitor-command", "arguments": {"command-line": "savevm sn1"}, "id": "JcqY2Shq"}
{"timestamp": {"seconds": 1629696679, "microseconds": 751965}, "event": "STOP"}
{"timestamp": {"seconds": 1629696742, "microseconds": 366660}, "event": "RESUME"}
{"return": "", "id": "JcqY2Shq"}
{"execute": "human-monitor-command", "arguments": {"command-line": "loadvm sn1"}, "id": "y5joUWoc"}
{"timestamp": {"seconds": 1629696762, "microseconds": 925256}, "event": "STOP"}
{"timestamp": {"seconds": 1629696768, "microseconds": 282469}, "event": "RESUME"} ----------------------------------> Resumed here.
{"return": "", "id": "y5joUWoc"}

Check the status of the guest:
(qemu) info status
VM status: running ---------------------------------> Guest is running well.

--- Additional comment from Tingting Mao on 2021-08-23 07:01:17 BST ---

Hi Matus,

According to Comment 9 and Comment 10, could you please help to retest with latest qemu build(qemu-kvm-6.0.0-12.el9)? 


Thanks,
Tingting Mao

--- Additional comment from Matus Marhefka on 2021-08-31 15:55:27 BST ---

Tested with qemu-kvm-6.0.0-12.el9 but the issue still persists.

What actually helped is the suggestion by Hana from the comment 8 to disable page-poison feature.
This is how the issue can be workarounded ATM (all credits go to Hana):

# virsh edit DOMAIN

and enable qemu command line passthrough by adding
"xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'" to the <domain> tag:

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>

Second, add this to the end of the <domain> block (i.e. at the bottom, above the </domain>):

<qemu:commandline>
   <qemu:arg value='-set'/>
   <qemu:arg value='device.balloon0.page-poison=off'/>
</qemu:commandline>

After rebooting the VM snapshot reverts using `virsh snapshot-revert` work as expected and there is no error anymore.

--- Additional comment from Klaus Heinrich Kiwi on 2021-08-31 19:25:34 BST ---

(In reply to Matus Marhefka from comment #12)
> Tested with qemu-kvm-6.0.0-12.el9 but the issue still persists.
> 
> What actually helped is the suggestion by Hana from the comment 8 to disable
> page-poison feature.
> This is how the issue can be workarounded ATM (all credits go to Hana):
Thanks Matus.

Hanna / Dr. David - is Hanna still the best person to investigate this, or should we reassign (migration, some other area)?

--- Additional comment from Yumei Huang on 2021-09-01 11:28:26 BST ---

I had a try with specifying page-poison=on in QEMU cli, and do savevm/loadvm repeatedly, guest works well, the issue is not reproduced.

qemu-kvm-6.0.0-12.el9
kernel-5.14.0-0.rc6.46.el9.x86_64


QEMU cli:
# /usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine pc,nvdimm=on  \
    -nodefaults \
    -device VGA,bus=pci.0,addr=0x2 \
    -m 8G  \
    -smp 16 \
    -cpu host \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 \
    -blockdev node-name=file_image1,driver=file,aio=threads,filename=/home/kvm_autotest_root/images/rhel900-64-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=qcow2,cache.direct=on,cache.no-flush=off,file=file_image1 \
    -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \
    -device virtio-net-pci,mac=9a:cb:96:f7:25:87,id=id41v7Gq,netdev=idMRebYW,bus=pci.0,addr=0x5  \
    -netdev tap,id=idMRebYW,vhost=on \
    -vnc :1  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm -monitor stdio \
    -qmp tcp:0:4444,server,nowait \
    -serial tcp:0:1234,server,nowait \
    -device virtio-balloon-pci,id=balloon0,page-poison=on


(qemu) savevm sn01
(qemu) loadvm sn01
(qemu) info status
VM status: running

--- Additional comment from Yumei Huang on 2021-09-02 07:41:33 BST ---

Hi Matus,

I failed to reproduce the issue with either QEMU or libvirt. The steps of QEMU are in comment 14. And libvirt's are as below. Would you please help check if anything missing in my steps? Thanks.


# virsh define xml1
Domain 'rhel9' defined from xml1

# virsh start rhel9
Domain 'rhel9' started

# virsh snapshot-create rhel9
Domain snapshot 1630562110 created

# virsh snapshot-list rhel9
 Name         Creation Time               State
---------------------------------------------------
 1630562110   2021-09-02 01:55:10 -0400   running

# virsh snapshot-revert rhel9 --current
 
# virsh list
 Id   Name    State
-----------------------
 1    rhel9   running


Test packages:
qemu-kvm-6.0.0-12.el9
libvirt-client-7.6.0-2.el9.x86_64
kernel-5.14.0-0.rc6.46.el9.x86_64

--- Additional comment from Hanna Reitz on 2021-09-03 16:57:39 BST ---

(In reply to Klaus Heinrich Kiwi from comment #13)
> Hanna / Dr. David - is Hanna still the best person to investigate this, or
> should we reassign (migration, some other area)?

I don’t think there’s much that I can investigate at this point, so I’d prefer reassigning this BZ to someone with migration or specifically virtio-balloon experience.


(In reply to Yumei Huang from comment #14)
> I had a try with specifying page-poison=on in QEMU cli, and do savevm/loadvm
> repeatedly, guest works well, the issue is not reproduced.

I’d like to note explicitly that page-poison=on is the default, so as then noted in comment 15, this just means that the issue isn’t reproducible with this environment.  (It does depend on the guest (because as I wrote in comment 8, the guest also needs to enable the feature, and that doesn’t seem to be a given), but then again, you seem to have used a RHEL 9 guest just as reported in comment 0...)

Hanna

--- Additional comment from Klaus Heinrich Kiwi on 2021-09-03 17:24:29 BST ---

(In reply to Hanna Reitz from comment #16)
> (In reply to Klaus Heinrich Kiwi from comment #13)
> > Hanna / Dr. David - is Hanna still the best person to investigate this, or
> > should we reassign (migration, some other area)?
> 
> I don’t think there’s much that I can investigate at this point, so I’d
> prefer reassigning this BZ to someone with migration or specifically
> virtio-balloon experience.
> 
Thanks Hanna. I'm parking it in the live-migration queue for now, for the lack of a better match.

--- Additional comment from Matus Marhefka on 2021-09-06 14:21:43 BST ---

Hello Yumei,

Please try to follow steps from the comment 1 and you should be able to reproduce it. It works for me and I've already tested it on multiple RHEL9 composes.

--- Additional comment from Matus Marhefka on 2021-09-06 14:23:16 BST ---

(In reply to Matus Marhefka from comment #18)
> Hello Yumei,
> 
> Please try to follow steps from the comment 1 and you should be able to
> reproduce it. It works for me and I've already tested it on multiple RHEL9
> composes.

I meant comment 0.

--- Additional comment from Dr. David Alan Gilbert on 2021-09-06 18:45:44 BST ---

Curious; this is confusing me since my reading of the page poison mechanism is that it doesn't change the queue behaviour in qemu;
so given that the error is a queue inconsistency I'm not sure what's going on.
I *think* it's just a change in config data - but even then I can see the virtio_balloon_config_size has a check in.

The other feature added at the same time (free page reporting) seems more likely to do bad things to the queues; but there again it has it's own queue,
not the queue 0 that the error is reporting.

--- Additional comment from Dr. David Alan Gilbert on 2021-09-06 19:28:07 BST ---

(In reply to Matus Marhefka from comment #12)
> Tested with qemu-kvm-6.0.0-12.el9 but the issue still persists.
> 
> What actually helped is the suggestion by Hana from the comment 8 to disable
> page-poison feature.
> This is how the issue can be workarounded ATM (all credits go to Hana):
> 
> # virsh edit DOMAIN
> 
> and enable qemu command line passthrough by adding
> "xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'" to the <domain>
> tag:
> 
> <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
> 
> Second, add this to the end of the <domain> block (i.e. at the bottom, above
> the </domain>):
> 
> <qemu:commandline>
>    <qemu:arg value='-set'/>
>    <qemu:arg value='device.balloon0.page-poison=off'/>
> </qemu:commandline>
> 
> After rebooting the VM snapshot reverts using `virsh snapshot-revert` work
> as expected and there is no error anymore.

Hi Mattus,
  Could you please try disabling free-page-reporting instead of page-poison and see if
just turning off free-page-reporting fixes it?
(If I read the guest kenel code right, then disabling poisoning also disables free page reporting)

--- Additional comment from Yumei Huang on 2021-09-08 05:27:04 BST ---

(In reply to Matus Marhefka from comment #18)
> Hello Yumei,
> 
> Please try to follow steps from the comment 1 and you should be able to
> reproduce it. It works for me and I've already tested it on multiple RHEL9
> composes.

Thanks Matus.

I was able to reproduce the issue with the exact steps in comment0.

--- Additional comment from Dr. David Alan Gilbert on 2021-09-08 10:15:34 BST ---

Same here; I've just reproduced this on a rhel8 host.

--- Additional comment from Dr. David Alan Gilbert on 2021-09-08 10:21:12 BST ---

I just confirmed it:
  works with page-poison=off
  still fails with free-page-reporting=off

--- Additional comment from Dr. David Alan Gilbert on 2021-09-08 20:05:55 BST ---

There's depressingly little interesting to see from the qemu side
  virtio_balloon_handle_output is never called in the good or bad version
  the queue data seems to be correctly aligned/loaded
  the source/dest both agree page_poison is enabled and get the config size the same

time to dig into the guest kernel driver.

--- Additional comment from Dr. David Alan Gilbert on 2021-09-09 15:13:46 BST ---

confirmed it still happens on upstream 5.14.0 guest

--- Additional comment from Dr. David Alan Gilbert on 2021-09-09 17:49:30 BST ---

the guest driver init code seems happy :
[    2.310088] virtballoon_validate: entry
[    2.310893] virtballoon_validate: exit
[    2.311305] virtballoon_probe: entry
[    2.311634] virtballoon_probe: (a)
[    2.311979] virtballoon_probe: (b)
[    2.312296] ballon:init_vqs entry
[    2.312877] ballon:init_vqs exit good
[    2.313211] virtballoon_probe: (c)
[    2.313543] virtballoon_probe: (d)
[    2.313864] virtballoon_probe: (e)
[    2.314170] virtballoon_probe: (f)
[    2.314475] virtballoon_probe: (g)
[    2.314786] virtballoon_probe: (h)
[    2.315186] virtballoon_probe: good exit

--- Additional comment from Dr. David Alan Gilbert on 2021-09-09 19:43:53 BST ---

It is causing a config write with the poison data; but I can't see any side effects happening from it;
the other config values don't seem to change.
(Note, I'm not seeing anything anywhere that actually uses the poison data - if we need a fast way out of this then I suggest
we disable page-poison).

All the init code is apparently happy, and we never actually do any balloon transactions that I can tell.
It makes me wonder if the queue pointers are completely wrong.

--- Additional comment from Dr. David Alan Gilbert on 2021-09-13 14:44:09 BST ---

The last_avail_idx being 0 actually looks fine; it looks like the used pointer (coming from RAM?) is the bogus part

Bad case:

virtio_load: status=f queue_sel=2 (low)features=30000012 config_len=10 num queues: 3
virtio_balloon_post_load_device
virtio_load: i=0 lai: 0 used: 53236 inuse: 12300 num: 128
hmp_loadvm exit

Good case: 
virtio_load: status=f queue_sel=2 (low)features=30000002 config_len=8 num queues: 3
virtio_balloon_post_load_device
virtio_load: i=0 lai: 0 used: 0 inuse: 0 num: 128
virtio_load: i=1 lai: 0 used: 0 inuse: 0 num: 128
virtio_load: i=2 lai: 1 used: 0 inuse: 1 num: 128

--- Additional comment from Dr. David Alan Gilbert on 2021-09-13 17:24:00 BST ---

The source and dest disagree over the gpa of the usedrings:

source:
virtio_init_region_cache: vdev=0x48c6270 n=0 vq=0x7f300cc48010 descaddr: 1145af000 usedaddr: 1145af940 usedsize: 406
virtio_init_region_cache: vdev=0x48c6270 n=1 vq=0x7f300cc480b0 descaddr: 10db19000 usedaddr: 10db19940 usedsize: 406
virtio_init_region_cache: vdev=0x48c6270 n=2 vq=0x7f300cc48150 descaddr: 10db1a000 usedaddr: 10db1a940 usedsize: 406

dest:
virtio_init_region_cache: vdev=0x48c6270 n=0 vq=0x7f300cc48010 descaddr: 1145af000 usedaddr: 1145b0000 usedsize: 406

which I guess explains why it's loading junk from it.

--- Additional comment from Dr. David Alan Gilbert on 2021-09-13 18:15:17 BST ---

The dest is confused about flags; it thinks it's got a non-V1 device, and:

            /*
             * VIRTIO-1 devices migrate desc, used, and avail ring addresses so
             * only the region cache needs to be set up.  Legacy devices need
             * to calculate used and avail ring addresses based on the desc
             * address.
             */
            if (virtio_vdev_has_feature(vdev, VIRTIO_F_VERSION_1)) {
                virtio_init_region_cache(vdev, i);
            } else {
                virtio_queue_update_rings(vdev, i);
            }

so ends up doing update_rings which has an alignment in hence gets a different address.

virtio_load: features64: 30000012
virtio_load: V1=0
virtio_save: V1=1

--- Additional comment from John Ferlan on 2021-09-13 19:47:54 BST ---

Parking this with Dave for now w/ Triaged set since there's quite a bit that has gone on.

--- Additional comment from Dr. David Alan Gilbert on 2021-09-13 20:05:06 BST ---

I'm trying to find where our v1 bit went.
It's confusing.

We do suspect there's a config write from the guest happening before it's done a read - but hmm.

virtio_64bit_features_needed is connected to host_features having a high bit set - however it's used
to select whether to store *guest* features high bits.

On the source I'm seeing:
virtio_save (2952): V1=1 guest_features=130000012
virtio_64bit_features_needed: host_features=179000012

so you'd expect it to be saving the extra bits; but on the dest I'm seeing :

virtio_load: vdev=0x46ff8f0 status=f queue_sel=2 (low)features=30000012 config_len=10 num queues: 3
virtio_balloon_post_load_device
virtio_64bit_features_needed: host_features=179000012
virtio_load: features64: 30000012
virtio_set_features_nocheck: val=30000012 host_features=179000012
virtio_load: V1=0

that suggests that the dest isn't seeing the section.
(And the analyze script can't anaylse virtio sections, so I can't see if hte section is there easily)

--- Additional comment from Dr. David Alan Gilbert on 2021-09-14 13:49:52 BST ---

Got it; this is a very simple typo
@@ -852,7 +873,7 @@ static const VMStateDescription vmstate_virtio_balloon_free_page_hint = {
 };
 
 static const VMStateDescription vmstate_virtio_balloon_page_poison = {
-    .name = "vitio-balloon-device/page-poison",
+    .name = "virtio-balloon-device/page-poison",
     .version_id = 1,
     .minimum_version_id = 1,
     .needed = virtio_balloon_page_poison_support,

the problem is that completely screws up all the rest of the virtio state loading for virtio balloon; so we lose
at least things like the host_feature bits, and possibly a lot of other state.
This explains why sometimes it would cause the end of section canary to fail - but it doesn't always
and I'm not sure why.

That bug has been in there since 5.1; so we might need to backport this quite a way to make
migration safe with 9.0 guests.

--- Additional comment from Dr. David Alan Gilbert on 2021-09-14 13:50:24 BST ---

Moving to assigned, problem understood.

--- Additional comment from Dr. David Alan Gilbert on 2021-09-14 14:18:14 BST ---

posted upstream:
0001-virtio-balloon-Fix-page-poison-subsection-name.patch

--- Additional comment from Dr. David Alan Gilbert on 2021-09-14 14:49:59 BST ---

Hi Li,
  I think this bug probably happens if you try it on a rhel8 or rhel8-av host with a rhel9 guest.
Can you please try it on rhel8-av 8.6, 8.5 and 8.4, 8.3 - I believe 8.2 should work OK.

Comment 1 John Ferlan 2021-09-15 14:07:52 UTC
Upstream patch/discussion: https://lists.nongnu.org/archive/html/qemu-devel/2021-09/msg03682.html

Will clone for AV 8.5.0 to ensure this gets fixed there too.

Comment 2 Yumei Huang 2021-09-28 03:25:01 UTC
Reproduced with rhel8.5 guest(adding page_poison=1 kernel option) with q35 machine type.

qemu-kvm-6.0.0-29.module+el8.5.0+12386+43574bac
guest kernel: 4.18.0-332.el8.x86_64

Comment 3 Dr. David Alan Gilbert 2021-10-06 10:18:41 UTC
This is now merged upstream in 243a9284a989a38e32ce  so 8.6 will get it when it does the 6.2 rebase.
-> POST
(I have done a c9s backport for bz 1984401)

Comment 10 John Ferlan 2021-12-22 18:01:48 UTC
Mass update of DTM/ITM to +3 values since the rebase of qemu-6.2 into RHEL 8.6 has been delayed or slowed due to process roadblocks (authentication changes, gating issues). This avoids the DevMissed bot and worse the bot that could come along and strip release+. The +3 was chosen mainly to give a cushion. 

Also added the qemu-6.2 rebase bug 2027716 as a dependent.

Comment 13 Yanan Fu 2021-12-24 02:48:48 UTC
QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 14 Mario Casquero 2021-12-28 08:02:20 UTC
Test steps:

1. Boot rhel8.6 guest with balloon device under q35, see the full cmdline below.

2. Add 'page_poison=1' to guest kernel line, and reboot guest.

3. Create internal snapshot and revert it.
(qemu) savevm sn0
(qemu) loadvm sn0

Reproduce:

qemu-kvm-6.1.0-5.module+el8.6.0+13430+8fdd5f85
host kernel: 4.18.0-355.el8.x86_64
guest kernel: 4.18.0-356.el8.x86_64

Loadvm hit error as below,

(qemu) loadvm sn0
VQ 1 size 0x80 < last_avail_idx 0x0 - used_idx 0x32b0
Failed to load virtio-balloon:virtio
error while loading state for instance 0x0 of device '0000:00:05.0:00.0/virtio-balloon'
Error: Error -1 while loading VM state


Verify:
qemu-kvm-6.2.0-1.module+el8.6.0+13725+61ae1949
host kernel: 4.18.0-355.el8.x86_64
guest kernel: 4.18.0-356.el8.x86_64

The issue is gone, so it's been fixed.

Full qemu-kvm command:

/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1' \
    -machine q35 \
    -nodefaults \
    -device VGA,bus=pcie.0,addr=0x1  \
    -device pvpanic,ioport=0x505,id=id5SK4co  \
    -device pcie-root-port,id=pcie.0-root-port-2,slot=2,chassis=2,addr=0x2,bus=pcie.0 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie.0,addr=0x3 \
    -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/home/kvm_autotest_root/images/rhel860-64-virtio-scsi.qcow2 \
    -device scsi-hd,id=image1,drive=drive_image1 \
    -device pcie-root-port,id=pcie.0-root-port-4,slot=4,chassis=4,addr=0x4,bus=pcie.0 \
    -device virtio-net-pci,mac=9a:41:3a:3b:3c:3d,id=idzyzw7g,vectors=4,netdev=idhia6GM,bus=pcie.0-root-port-4,addr=0x0  \
    -netdev tap,id=idhia6GM \
    -smp 8 \
    -m 8G,maxmem=80G,slots=20 \
    -cpu host \
    -vnc :1  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot order=cdn,once=c,menu=off,strict=off \
    -enable-kvm \
    -device pcie-root-port,id=pcie.0-root-port-5,slot=5,chassis=5,addr=0x5,bus=pcie.0 \
    -device virtio-balloon-pci,id=balloon0,bus=pcie.0-root-port-5 \
    -monitor stdio \
    -serial tcp:0:1234,server,nowait \
    -monitor unix:/tmp/monitor3,server,nowait \
    -qmp tcp:0:3333,server,nowait \

Comment 17 errata-xmlrpc 2022-05-10 13:21:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: virt:rhel and virt-devel:rhel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1759


Note You need to log in before you can comment on or make changes to this bug.