RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1460962 - vm cannot be started if it has a corrupted managedsave file
Summary: vm cannot be started if it has a corrupted managedsave file
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libvirt
Version: 7.4
Hardware: x86_64
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: Jiri Denemark
QA Contact: Yanqiu Zhang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-13 09:15 UTC by yisun
Modified: 2018-04-10 10:50 UTC (History)
10 users (show)

Fixed In Version: libvirt-3.9.0-1.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-10 10:48:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2018:0704 0 None None None 2018-04-10 10:50:02 UTC

Description yisun 2017-06-13 09:15:03 UTC
Description: vm cannot be started if it has a corrupted managedsave file

Versions:
libvirt-3.2.0-9.el7.x86_64

PLS NOTE, THIS IS A REGRESSION ISSUE and NOT reproduced on:
libvirt-3.2.0-7.el7.x86_64
libvirt-2.0.0-10.el7_3.9.x86_64


How reproducible:
100%



Steps:
1. having a shutdown vm
# virsh list
 Id    Name                           State
----------------------------------------------------
 8     avocado-vt-vm1                 running

# virsh destroy avocado-vt-vm1
Domain avocado-vt-vm1 destroyed

2. create a invalid managedsave file for this vm
# touch /var/lib/libvirt/qemu/save/avocado-vt-vm1.save

3. start the vm, error happens
## virsh start avocado-vt-vm1
error: Failed to start domain avocado-vt-vm1
error: An error occurred, but the cause is unknown

Now, if we do the start again, vm'll be started, and this seeems because the managed save file removed by libvirt
## ll /var/lib/libvirt/qemu/save/avocado-vt-vm1.save
ls: cannot access /var/lib/libvirt/qemu/save/avocado-vt-vm1.save: No such file or directory

## virsh start avocado-vt-vm1
Domain avocado-vt-vm1 started


Actual result
Step 3 failed with ambiguous error msg.

Expected result
Step 3 should be successful as previous versions of libvirt


Additional info:
1. libvirtd log when failed to start vm
...
7980 2017-06-13 03:47:34.736+0000: 11284: debug : qemuDomainObjBeginJobInternal:3787 : Started async job: start (vm=0x7ff8c02c4ca0 name=avocado-vt-vm1)
7981 2017-06-13 03:47:34.736+0000: 11280: debug : virEventPollCalculateTimeout:359 : Schedule timeout then=1497325659736 now=1497325654736
7982 2017-06-13 03:47:34.736+0000: 11284: info : virObjectRef:296 : OBJECT_REF: obj=0x7ff8c0194ac0
7983 2017-06-13 03:47:34.736+0000: 11280: debug : virEventPollCalculateTimeout:369 : Timeout at 1497325659736 due in 5000 ms
7984 2017-06-13 03:47:34.736+0000: 11284: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7ff8c0194ac0
7985 2017-06-13 03:47:34.736+0000: 11280: info : virEventPollRunOnce:640 : EVENT_POLL_RUN: nhandles=10 timeout=5000
7986 2017-06-13 03:47:34.736+0000: 11284: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7ff8c0194ac0
7987 2017-06-13 03:47:34.736+0000: 11284: info : virObjectRef:296 : OBJECT_REF: obj=0x7ff8c0194ac0
7988 2017-06-13 03:47:34.736+0000: 11284: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7ff8c0194ac0
7989 2017-06-13 03:47:34.736+0000: 11284: info : virObjectRef:296 : OBJECT_REF: obj=0x7ff8c01fbf20
7990 2017-06-13 03:47:34.736+0000: 11284: info : virObjectRef:296 : OBJECT_REF: obj=0x7ff8c0194ac0
7991 2017-06-13 03:47:34.736+0000: 11284: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7ff8c0194ac0
7992 2017-06-13 03:47:34.736+0000: 11284: debug : virFileIsSharedFSType:3391 : Check if path /var/lib/libvirt/qemu/save/avocado-vt-vm1.save with FS magic 1481003842 is shared
7993 2017-06-13 03:47:34.736+0000: 11284: debug : virFileClose:110 : Closed fd 23    
7994 2017-06-13 03:47:34.736+0000: 11284: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7ff8c01fbf20
7995 2017-06-13 03:47:34.736+0000: 11284: warning : qemuDomainObjStart:7112 : Unable to restore from managed state /var/lib/libvirt/qemu/save/avocado-vt-vm1.save. Maybe the file is corrupted?
7996 2017-06-13 03:47:34.736+0000: 11284: debug : qemuDomainObjEndAsyncJob:3955 : Stopping async job: start (vm=0x7ff8c02c4ca0 name=avocado-vt-vm1)
7997 2017-06-13 03:47:34.736+0000: 11284: info : virObjectRef:296 : OBJECT_REF: obj=0x7ff8c0194ac0
7998 2017-06-13 03:47:34.736+0000: 11284: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7ff8c0194ac0
7999 2017-06-13 03:47:34.736+0000: 11284: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7ff8c02c4ca0
8000 2017-06-13 03:47:34.736+0000: 11284: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7ff90c002da0
8001 2017-06-13 03:47:34.736+0000: 11284: info : virObjectUnref:261 : OBJECT_DISPOSE: obj=0x7ff90c002da0
8002 2017-06-13 03:47:34.736+0000: 11284: debug : virDomainDispose:316 : release domain 0x7ff90c002da0 avocado-vt-vm1 19b984c1-07ec-43b6-a253-a2b3e23ad476
...



2. start vm in step 3 successfully on libvirt-3.2.0-7.el7.x86_64 and libvirt-2.0.0-10.el7_3.9.x86_64

Comment 3 Jiri Denemark 2017-06-13 11:23:06 UTC
Oops, caused by

commit ac793bd7195ab99445cf6c6d6053439c56cef922
Author:     Jiri Denemark <jdenemar>
AuthorDate: Tue Jun 6 22:27:57 2017 +0200
Commit:     Jiri Denemark <jdenemar>
CommitDate: Wed Jun 7 13:36:01 2017 +0200

    qemu: Fix memory leaks in qemuDomainSaveImageOpen
    
    Signed-off-by: Jiri Denemark <jdenemar>
    Reviewed-by: Pavel Hrdina <phrdina>

which switched from directly returning with -3 to a goto, but failed to change the "return -1" statement at the end of the error path.

Comment 4 Jiri Denemark 2017-06-13 11:34:15 UTC
Patch sent upstream for review: https://www.redhat.com/archives/libvir-list/2017-June/msg00541.html

Comment 5 Jiri Denemark 2017-06-13 11:56:17 UTC
Fixed upstream now by

commit 16e31fb38da3c2b9a35faff9ac626d947199cf13
Refs: v3.4.0-97-g16e31fb38
Author:     Jiri Denemark <jdenemar>
AuthorDate: Tue Jun 13 13:25:07 2017 +0200
Commit:     Jiri Denemark <jdenemar>
CommitDate: Tue Jun 13 13:46:40 2017 +0200

    qemu: Fix starting a domain with corrupted managed save file

    Commit v3.4.0-44-gac793bd71 fixed a memory leak, but failed to return
    the special -3 value. Thus an attempt to start a domain with corrupted
    managed save file would removed the corrupted file and report
    "An error occurred, but the cause is unknown" instead of starting the
    domain from scratch.

    https://bugzilla.redhat.com/show_bug.cgi?id=1460962

Comment 6 yisun 2017-06-29 07:18:08 UTC
Hit another issue, should be same root cause, doc it here, pls Jiri help to confirm.

Summary: cannot undefine a VM when it used to have a corrupted manavedsave file which is already removed

Steps:
1. make a managedsave
root@localhost ~  ## virsh managedsave avocado-vt-vm1

Domain avocado-vt-vm1 state saved by libvirt


2. corrupt the managedsave file
root@localhost ~  ## echo > /var/lib/libvirt/qemu/save/avocado-vt-vm1.save

3. try to start the vm
root@localhost ~  ## virsh start avocado-vt-vm1
error: Failed to start domain avocado-vt-vm1
error: An error occurred, but the cause is unknown

4. now we can see the managedsave file removed
root@localhost ~  ## ll /var/lib/libvirt/qemu/save/avocado-vt-vm1.save
ls: cannot access /var/lib/libvirt/qemu/save/avocado-vt-vm1.save: No such file or directory

5. try to undefine the vm
root@localhost ~  ## virsh undefine avocado-vt-vm1
error: Refusing to undefine while domain managed save image exists
<=== now, we cannot undefine the

Comment 7 Jiri Denemark 2017-06-29 08:34:23 UTC
Yeah, it's caused by the same bug. Libvirtd still thinks the domain has a saved state since it didn't notice it was removed because it was corrupted. Restarting libvirtd should let you undefined the domain.

Comment 9 Yanqiu Zhang 2017-10-24 06:42:24 UTC
Reproduce this bug with libvirt-3.2.0-14.el7_4.2.x86_64

Steps to reproduce:
1.# virsh list --all --managed-save
 Id    Name                           State
----------------------------------------------------
 -     V                              shut off

# ls /var/lib/libvirt/qemu/save/
# echo > /var/lib/libvirt/qemu/save/V.save
# virsh list --all --managed-save
 Id    Name                           State
----------------------------------------------------
 -     V                              shut off

# virsh start V
error: Failed to start domain V
error: An error occurred, but the cause is unknown  <== Reproduced

# virsh list --all --managed-save
 Id    Name                           State
----------------------------------------------------
 -     V                              shut off

2.# virsh start V
Domain V started

# virsh list --all --managed-save
 Id    Name                           State
----------------------------------------------------
 206   V                              running

# virsh managedsave V

Domain V state saved by libvirt

# virsh list --all --managed-save
 Id    Name                           State
----------------------------------------------------
 -     V                              saved

# echo > /var/lib/libvirt/qemu/save/V.save
# virsh start V
error: Failed to start domain V
error: An error occurred, but the cause is unknown  <== Reproduced

# virsh list --all --managed-save
 Id    Name                           State
----------------------------------------------------
 -     V                              saved

# virsh start V
Domain V started

# virsh list --all --managed-save
 Id    Name                           State
----------------------------------------------------
 207   V                              running

# virsh destroy V
Domain V destroyed
# virsh list --all --managed-save
 Id    Name                           State
----------------------------------------------------
 -     V                              saved




Verify this bug with libvirt-3.8.0-1.el7.x86_64.

Steps to verify:
1.# virsh list --all --managed-save
 Id    Name                           State
----------------------------------------------------
 -     V                              shut off

# ls /var/lib/libvirt/qemu/save/
# echo > /var/lib/libvirt/qemu/save/V.save
#  virsh list --all --managed-save
 Id    Name                           State
----------------------------------------------------
 -     V                              shut off

#  virsh start V
Domain V started             <== Successfully started without error.

2.#  virsh list --all --managed-save
 Id    Name                           State
----------------------------------------------------
 1     V                              running

# virsh managedsave V

Domain V state saved by libvirt

#  virsh list --all --managed-save
 Id    Name                           State
----------------------------------------------------
 -     V                              saved

#  echo > /var/lib/libvirt/qemu/save/V.save
#  virsh start V
Domain V started             <== Successfully started without error.

#  virsh list --all --managed-save
 Id    Name                           State
----------------------------------------------------
 2     V                              running

# virsh destroy V
Domain V destroyed

#  virsh list --all --managed-save
 Id    Name                           State
----------------------------------------------------
 -     V                              saved

# virsh undefine V
error: Refusing to undefine while domain managed save image exists

# systemctl restart libvirtd

#  virsh list --all --managed-save
 Id    Name                           State
----------------------------------------------------
 -     V                              shut off

# virsh undefine V
Domain V has been undefined


Above 'start' behavior get the expected result.

But, Jiri, one more question:
In last a few steps, after start guest with a corrupted image, the managed-saved status can only be cancelled by restart libvirtd, even though I start/destroy the guest for many times it cannot be cancelled. Do you think it's okay?

Comment 10 Jiri Denemark 2017-10-24 08:30:45 UTC
Oops, looks like we don't reset the managed-saved status after deleting a corrupted save image. An additional trivial patch is needed...

Comment 11 Jiri Denemark 2017-10-24 08:40:49 UTC
Patch sent upstream for review: https://www.redhat.com/archives/libvir-list/2017-October/msg01079.html

Comment 12 Jiri Denemark 2017-10-24 09:13:07 UTC
Fixed upstream by

commit f26636887fee11b3ecaa5c0a0734687cded8ed28
Refs: v3.8.0-237-gf26636887
Author:     Jiri Denemark <jdenemar>
AuthorDate: Tue Oct 24 10:32:03 2017 +0200
Commit:     Jiri Denemark <jdenemar>
CommitDate: Tue Oct 24 11:07:10 2017 +0200

    qemu: Reset hasManagedSave after removing a corrupted image

    When starting a domain with managed save image, we try to restore it
    first. If the image is corrupted, we silently unlink it and just
    normally start the domain. At this point the domain has no managed save
    image, yet we did not reset the hasManagedSave flag.

    https://bugzilla.redhat.com/show_bug.cgi?id=1460962

    Signed-off-by: Jiri Denemark <jdenemar>

Comment 13 Yanqiu Zhang 2017-11-22 08:34:29 UTC
Verify this bug with libvirt-3.9.0-2.el7.x86_64:

1.Newly create a corrupted saved image: 
# virsh list --all --managed-save
 Id    Name                           State
----------------------------------------------------
 -     V                              shut off

# ls /var/lib/libvirt/qemu/save/
# echo >  /var/lib/libvirt/qemu/save/V.save

# virsh list --all --managed-save
 Id    Name                           State
----------------------------------------------------
 -     V                              shut off

# virsh start V
Domain V started

# ls /var/lib/libvirt/qemu/save/V.save

# virsh destroy V
Domain V destroyed

# virsh list --all --managed-save
 Id    Name                           State
----------------------------------------------------
 -     V                              shut off  <== status is not "saved"

2.Corrupt an existing saved image:
# virsh list --all --managed-save
 Id    Name                           State
----------------------------------------------------
 6     V                              running

# virsh managedsave V

Domain V state saved by libvirt

# virsh list --all --managed-save
 Id    Name                           State
----------------------------------------------------
 -     V                              saved

# echo >  /var/lib/libvirt/qemu/save/V.save

# virsh start V
Domain V started

# virsh list --all --managed-save
 Id    Name                           State
----------------------------------------------------
 7     V                              running

# virsh destroy V
Domain V destroyed

# virsh list --all --managed-save
 Id    Name                           State
----------------------------------------------------
 -     V                              shut off  <== status is not "saved"

And guest can be undefined.


According to comment 9 and this comment. Mark this bug as verified.

Comment 17 errata-xmlrpc 2018-04-10 10:48:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0704


Note You need to log in before you can comment on or make changes to this bug.