Bug 1531888 - Local VM and migrated VM on the same host can run with same RAW file as visual disk source while without shareable configured or lock manager enabled
Summary: Local VM and migrated VM on the same host can run with same RAW file as visua...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev
Version: 7.5
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Kevin Wolf
QA Contact: CongLi
URL:
Whiteboard:
Depends On:
Blocks: 1673010 1673014
TreeView+ depends on / blocked
 
Reported: 2018-01-06 12:28 UTC by jiyan
Modified: 2019-08-22 09:19 UTC (History)
20 users (show)

Fixed In Version: qemu-kvm-rhev-2.12.0-22.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1673010 1673014 (view as bug list)
Environment:
Last Closed: 2019-08-22 09:18:46 UTC
Target Upstream Version:


Attachments (Terms of Use)
Libvirtd log for migration step (629.45 KB, text/plain)
2019-01-30 07:59 UTC, jiyan
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2019:2553 None None None 2019-08-22 09:19:41 UTC

Description jiyan 2018-01-06 12:28:36 UTC
Description of problem:
VM1 runs in host1(RHEL7.5 without lock manager configured), VM2 with same RAW source file runs in host2 (RHEL7.5 without lock manager configured), migration should fail but succeed while both VM1 and VM2 are not configured with shareable

Version-Release number of selected component (if applicable):
libvirt-3.9.0-7.el7.x86_64
qemu-kvm-rhev-2.10.0-14.el7.x86_64
kernel-3.10.0-823.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
Step1: Preprare 2 host with VM running on it.
host1: # virsh domstate test
shut off

host1: # virsh dumpxml test |grep "<disk" -A5
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/var/lib/libvirt/images/RHEL-7.5-x86_64-latest.img'/>
      <target dev='hda' bus='ide'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>

host1: # virsh start test
Domain test started

host2: # virsh domstate test1
shut off

host2: # virsh dumpxml test1 |grep "<disk" -A5
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/var/lib/libvirt/images/RHEL-7.5-x86_64-latest.img'/>
      <target dev='hda' bus='ide'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>

host2:# virsh start test1
Domain test1 started

Step2: Migrate test from host1 to host2
host1: # virsh migrate --live test qemu+ssh://10.73.72.81/system --verbose
root@10.73.72.81's password: 
Migration: [100 %]

Step3: Check status of VM in host2
host2: # virsh list 
 Id    Name                           State
----------------------------------------------------
 26    test1                          running
 27    test                           running

host2: # virsh dumpxml test1 |grep "<disk" -A7;virsh dumpxml test |grep "<disk" -A7
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/var/lib/libvirt/images/RHEL-7.5-x86_64-latest.img'/>
      <backingStore/>
      <target dev='hda' bus='ide'/>
      <alias name='ide0-0-0'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/var/lib/libvirt/images/RHEL-7.5-x86_64-latest.img'/>
      <backingStore/>
      <target dev='hda' bus='ide'/>
      <alias name='ide0-0-0'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>

Actual results:
As step3 shows


Expected results:
There should be error info such as: "write" lock raised when migration.
https://bugzilla.redhat.com/show_bug.cgi?id=1511480#c6

Additional info:

Comment 2 Peter Krempa 2018-01-08 09:00:57 UTC
This is a bug in the qemu locking code, so I'm reassigning this to qemu-kvm. 

Also this might be impossible to solve since qemu can't know whether the image is locked by the original source or another VM. If qemu can't solve this please close this bug as CANTFIX since libvirt has it's own locking.

Comment 5 Fam Zheng 2018-01-29 12:24:23 UTC
Looking at the stderr of QEMU, there _is_ an error like:

"Failed to get "write" lock
Is another process using the image?"

However, it is not reported to the monitor because the point where this happens is at the end of migration and we currently don't have a channel to report it to libvirt correctly, neither does the migration process abort/cancel due to the locking failure.

The fix would be changing destination QEMU to fail the migration if image locking failed.

Comment 6 Peter Krempa 2018-01-29 12:51:17 UTC
Well, libvirt only ever looks at the stderr output if the qemu process exits before the monitor is up. After that the error should be reported on the monitor. Either by failing the migration, which is properly wired up in libvirt's migration code or by reporting an event along with obviously stopping the cpus.

Comment 8 Markus Armbruster 2019-01-28 09:13:44 UTC
May I have your libvirt logs, please?

Comment 10 jiyan 2019-01-30 07:59:05 UTC
Created attachment 1524938 [details]
Libvirtd log for migration step

Comment 11 Markus Armbruster 2019-01-30 18:15:42 UTC
Alright, let's simplify the reproducer.

0. Pick a raw test image.  Any image should do.  Mine is called test.img.

1. Run a qemu-kvm using that image:

   $ qemu-kvm -nodefaults -S -display none -drive if=none,id=disk0,format=raw,file=test.img -device virtio-blk-pci,drive=disk0

   Starts successfully.

2. In a second terminal, run another qemu-kvm using the same image:

   $ qemu-kvm -nodefaults -S -display none -drive if=none,id=disk0,format=raw,file=test.img -device virtio-blk-pci,drive=disk0
   qemu-kvm: -device virtio-blk-pci,drive=disk0: Failed to get "write" lock
   Is another process using the image?

   Fails as expected.

3. Same, but add -incoming defer:

   $ qemu-kvm -nodefaults -S -display none -drive if=none,id=disk0,format=raw,file=test.img -device virtio-blk-pci,drive=disk0 -incoming defer

   Starts successfully.

   So, -incoming defer somehow suppresses the error, and it's how libvirt starts qemu-kvm on the destination end of the migration, as the logs from comment#10 show.

   I tracked this down to blk_do_attach_dev():

    /* While migration is still incoming, we don't need to apply the
     * permissions of guest device BlockBackends. We might still have a block
     * job or NBD server writing to the image for storage migration. */
    if (runstate_check(RUN_STATE_INMIGRATE)) {
        blk->disable_perm = true;
    }

   This is from commit d35ff5e6b3aa3a706b0aa3bcf11400fac945b67a "block: Ignore guest dev permissions during incoming migration".

4. Kill the second qemu-kvm, restart it with a monitor, then make it await incoming migration

   $ qemu-kvm -nodefaults -S -display none -drive if=none,id=disk0,format=raw,file=test.img -device virtio-blk-pci,drive=disk0 -incoming defer -monitor stdio
   QEMU 2.12.0 monitor - type 'help' for more information
   (qemu) migrate_incoming unix:test-mig
   (qemu) 

5. In a third terminal, copy the image, start a third qemu-kvm, then migrate it to the second one

   $ cp test.img copy-of-test.img
   $ qemu-kvm -nodefaults -S -display none -drive if=none,id=disk0,format=raw,file=copy-of-test.img -device virtio-blk-pci,drive=disk0 -monitor stdio
   QEMU 2.12.0 monitor - type 'help' for more information
   (qemu) migrate unix:test-mig
   (qemu) 

   Apparently succeeds, but the second qemu-kvm (in the second terminal) gripes:

   rhev7-qemu-kvm: Failed to get "write" lock
   Is another process using the image?

   Matches what Fam observed in comment#5.

Comment 12 Markus Armbruster 2019-01-31 13:40:14 UTC
Slightly refined simplified reproducer:

= Terminal 1 =

    $ qemu-kvm -nodefaults -display none -drive if=none,id=disk0,format=raw,file=test.img -device virtio-blk-pci,drive=disk0 -monitor stdio
    QEMU 2.12.0 monitor - type 'help' for more information
    (qemu) info block
    disk0 (#block122): test.img (raw)
	Attached to:      /machine/peripheral-anon/device[0]/virtio-backend
	Cache mode:       writethrough
    (qemu) info status
    VM status: running
    (qemu) 

Nothing interesting's going to happen in this terminal from now on.  Its
sole purpose is holding a lock on test.img.

= Terminal 2 =

    $ qemu-kvm -nodefaults -display none -drive if=none,id=disk0,format=raw,file=test.img -device virtio-blk-pci,drive=disk0 -incoming unix:test-mig -monitor stdio
    QEMU 2.12.0 monitor - type 'help' for more information
    (qemu) info status
    VM status: paused (inmigrate)
    (qemu) 

= Terminal 3 =

    $ qemu-kvm -nodefaults -display none -drive if=none,id=disk0,format=raw,file=copy-of-test.img -device virtio-blk-pci,drive=disk0 -monitor stdio
    QEMU 2.12.0 monitor - type 'help' for more information
    (qemu) info status
    VM status: running
    (qemu) migrate unix:test-mig

The command completes after a small delay:

    (qemu) 

= Terminal 2 =

Locking fails:

    qemu-kvm: Failed to get "write" lock
    Is another process using the image [test.img]?

Checking status:

    info status
    VM status: paused
   (qemu) 
= Terminal 3 =

    info status
    VM status: paused (postmigrate)
   (qemu) 

= Terminal 2 =

Continue the stopped migrated guest, and check its status:

    cont
    (qemu) info status
    VM status: running
    (qemu) info block
    disk0 (#block130): test.img (raw)
	Attached to:      /machine/peripheral-anon/device[0]/virtio-backend
	Cache mode:       writethrough
    (qemu) 

I now have two guests running, and both use test.img.  Oops.

Upstream behaves the same.

I enlisted Kevin Wolf's and Dave Gilbert's assistance.  They pointed out that with QCOW2 rather than raw images, the "cont" fails, and the guest remains stopped.

Armed with that information, Kevin went looking for the reason, and found a bug on an error path that isn't taken with QCOW2.  He volunteered to take this bug.

Thanks, guys!

Comment 15 Miroslav Rezanina 2019-02-05 17:08:08 UTC
Fix included in qemu-kvm-rhev-2.12.0-22.el7

Comment 16 CongLi 2019-02-19 09:46:47 UTC
Reproduced on qemu-kvm-rhev-2.12.0-18.el7_6.1.x86_64 with steps in comment 12:
Terminal 1:
(qemu) info status
VM status: running
Terminal 2:
(qemu) c
Failed to get "write" lock
Is another process using the image?
**after migration**
(qemu) info status
VM status: running
Terminal 3:
(qemu) info status
VM status: paused (postmigrate)

Verified this bug on qemu-kvm-rhev-2.12.0-23.el7.x86_64:
Terminal 1:
(qemu) info status
VM status: running
Terminal 2:
(qemu) c
Failed to get "write" lock
Is another process using the image [/home/kvm_autotest_root/images/rhel76-64-virtio-scsi.raw]?
**after migration**
(qemu) qemu-kvm: Failed to get "write" lock
Is another process using the image [/home/kvm_autotest_root/images/rhel76-64-virtio-scsi.raw]?
(qemu) info status
VM status: paused
(qemu) c
Failed to get "write" lock
Is another process using the image [/home/kvm_autotest_root/images/rhel76-64-virtio-scsi.raw]?
Terminal 3:
(qemu) info status
VM status: paused (postmigrate)

Thanks.

Comment 18 CongLi 2019-03-04 07:36:20 UTC
Verified this bug per comment 16.

Thanks.

Comment 20 errata-xmlrpc 2019-08-22 09:18:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:2553


Note You need to log in before you can comment on or make changes to this bug.