Red Hat Bugzilla – Bug 903248
vdsm: cannot run a vm with several snapshots
Last modified: 2016-08-12 06:00:06 EDT
Created attachment 685972 [details] logs Description of problem: I created a vm with several snapshots and when I try to run it we fail with thr following error: qemu-kvm: -drive file=/rhev/data-center/cee3603b-2308-4973-97a8-480f7d6d2132/c6443d8b-14e3-425d-9d3f-2e59820ea3e5/images/0cccaded-3aec-45b9-a24e-1a91d9fe968c/9b7d1880-9be1-4661-b194-4402ae1ac745,if=none,id=drive-virtio-disk0,format=qcow 2,serial=0cccaded-3aec-45b9-a24e-1a91d9fe968c,cache=none,werror=stop,rerror=stop,aio=native: could not open disk image /rhev/data-center/cee3603b-2308-4973-97a8-480f7d6d2132/c6443d8b-14e3-425d-9d3f-2e59820ea3e5/images/0cccaded-3aec-45b9 -a24e-1a91d9fe968c/9b7d1880-9be1-4661-b194-4402ae1ac745: Operation not permitted Version-Release number of selected component (if applicable): sf3 vdsm-4.10.2-3.0.el6ev.x86_64 libvirt-0.10.2-16.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.348.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. create a vm with several snapshots 2. try to run the vm 3. Actual results: we fail to run the vm Expected results: we should be able to run the vm Additional info: attaching all logs Thread-130836::ERROR::2013-01-23 12:31:30,294::vm::680::vm.Vm::(_startUnderlyingVm) vmId=`9a11037b-0546-4057-b98f-6014ef8b81d5`::The vm start process failed Traceback (most recent call last): File "/usr/share/vdsm/vm.py", line 642, in _startUnderlyingVm self._run() File "/usr/share/vdsm/libvirtvm.py", line 1486, in _run self._connection.createXML(domxml, flags), File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2645, in createXML if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self) libvirtError: internal error process exited while connecting to monitor: char device redirected to /dev/pts/1 qemu-kvm: -drive file=/rhev/data-center/cee3603b-2308-4973-97a8-480f7d6d2132/c6443d8b-14e3-425d-9d3f-2e59820ea3e5/images/bb40470e-e8b2-4b13-9cec-cded6f27ac6c/b257f7a9-3782-4b4d-8bdb-615a9f2dadb9,if=none,id=drive-ide0-0-0,format=qcow2,serial=bb40470e-e8b2-4b13-9cec-cded6f27ac6c,cache=none,werror=stop,rerror=stop,aio=native: could not open disk image /rhev/data-center/cee3603b-2308-4973-97a8-480f7d6d2132/c6443d8b-14e3-425d-9d3f-2e59820ea3e5/images/bb40470e-e8b2-4b13-9cec-cded6f27ac6c/b257f7a9-3782-4b4d-8bdb-615a9f2dadb9: Operation not permitted Thread-130836::DEBUG::2013-01-23 12:31:30,310::vm::1027::vm.Vm::(setDownStatus) vmId=`9a11037b-0546-4057-b98f-6014ef8b81d5`::Changed state to Down: internal error process exited while connecting to monitor: char device redirected to /dev/pts/1
Fede, dup of bug 890956 ?
while testing https://bugzilla.redhat.com/show_bug.cgi?id=878041 I hit this issue again. this bug is reproduced 100% with scenario below: 1. run a HA vm 2. create 3 live snapshots 3. kill -9 the vm's pid currently the vm fails to start with the following error: Thread-52157::ERROR::2013-02-04 11:15:57,005::vm::700::vm.Vm::(_startUnderlyingVm) vmId=`09256c11-7021-485d-92d3-37f3faba5689`::The vm start process failed Traceback (most recent call last): File "/usr/share/vdsm/vm.py", line 662, in _startUnderlyingVm self._run() File "/usr/share/vdsm/libvirtvm.py", line 1518, in _run self._connection.createXML(domxml, flags), File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 104, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2645, in createXML if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self) libvirtError: internal error process exited while connecting to monitor: qemu-kvm: -drive file=/rhev/data-center/b08d859c-f654-4976-8c7c-e102d4c252fb/43ce579b-57d0-4201-91b9-f2436f925a50/images/4af22343-6df8-4c36-adc9-bca01e423e39/b63b4 75f-c08c-4be9-a577-2c6e262c9afe,if=none,id=drive-virtio-disk0,format=qcow2,serial=4af22343-6df8-4c36-adc9-bca01e423e39,cache=none,werror=stop,rerror=stop,aio=native: could not open disk image /rhev/data-center/b08d859c-f654-4976-8c7c-e1 02d4c252fb/43ce579b-57d0-4201-91b9-f2436f925a50/images/4af22343-6df8-4c36-adc9-bca01e423e39/b63b475f-c08c-4be9-a577-2c6e262c9afe: Operation not permitted for QE - when verifying this fix please also make sure to test HA vm's in above scenario.
Dafna - does it happen on 3.1.3 (with a RHEL 6.4 based VDSM) ?
yes - it also reproduces in 3.1.3 Thread-258923::ERROR::2013-02-04 12:43:28,184::vm::680::vm.Vm::(_startUnderlyingVm) vmId=`19eff98f-c99a-41d7-81de-d806690aaedc`::The vm start process failed Traceback (most recent call last): File "/usr/share/vdsm/vm.py", line 642, in _startUnderlyingVm self._run() File "/usr/share/vdsm/libvirtvm.py", line 1472, in _run self._connection.createXML(domxml, flags), File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2645, in createXML if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self) libvirtError: internal error process exited while connecting to monitor: qemu-kvm: -drive file=/rhev/data-center/3653dcb1-d4ce-477f-acb7-4fe7116b7950/127a0092-fde3-454c-bae5-edc00aa136d4/images/dee76e92-3790-464e-940f-f17f8c15f629/f7530761-a03d-46fd-871d-4ff3575fbbbf,if=none,id=drive-virtio-disk0,format=qcow2,serial=dee76e92-3790-464e-940f-f17f8c15f629,cache=none,werror=stop,rerror=stop,aio=native: could not open disk image /rhev/data-center/3653dcb1-d4ce-477f-acb7-4fe7116b7950/127a0092-fde3-454c-bae5-edc00aa136d4/images/dee76e92-3790-464e-940f-f17f8c15f629/f7530761-a03d-46fd-871d-4ff3575fbbbf: Permission denied Thread-258923::DEBUG::2013-02-04 12:43:28,192::vm::1027::vm.Vm::(setDownStatus) vmId=`19eff98f-c99a-41d7-81de-d806690aaedc`::Changed state to Down: internal error process exited while connecting to monitor: qemu-kvm: -drive file=/rhev/data-center/3653dcb1-d4ce-477f-acb7-4fe7116b7950/127a0092-fde3-454c-bae5-edc00aa136d4/images/dee76e92-3790-464e-940f-f17f8c15f629/f7530761-a03d-46fd-871d-4ff3575fbbbf,if=none,id=drive-virtio-disk0,format=qcow2,serial=dee76e92-3790-464e-940f-f17f8c15f629,cache=none,werror=stop,rerror=stop,aio=native: could not open disk image /rhev/data-center/3653dcb1-d4ce-477f-acb7-4fe7116b7950/127a0092-fde3-454c-bae5-edc00aa136d4/images/dee76e92-3790-464e-940f-f17f8c15f629/f7530761-a03d-46fd-871d-4ff3575fbbbf: Permission denied
Dafna, what version did it work in that it's marked as a regression?
I did not see a problem running the same with rhel6.3 but I am willing to recheck.
I ran the same test with rhel6.3 and the vm restarted with no issues. so yes - this is a regression from vdsm-4.9.6-44.3.el6_3.x86_64 with 2.6.32-279.19.1.el6.x86_64
My feeling is that there's something wrong on how libvirt is computing the relative path: 2013-02-04 15:42:57.618+0000: 21720: error : absolutePathFromBaseFile:560 : Can't canonicalize path '../4af22343-6df8-4c36-adc9-bca01e423e39/d2cae84a-8f59-414c-9040-d4df7b9141ad': No such file or directory 2013-02-04 15:42:57.618+0000: 21720: warning : virStorageFileGetMetadataFromBuf:736 : Backing file '../4af22343-6df8-4c36-adc9-bca01e423e39/d2cae84a-8f59-414c-9040-d4df7b9141ad' of image '/dev/dm-46' is missing. 2013-02-04 15:42:57.618+0000: 21720: debug : virFileClose:72 : Closed fd 26 2013-02-04 15:42:57.618+0000: 21720: debug : qemuSetupDiskPathAllow:71 : Process path /rhev/data-center/b08d859c-f654-4976-8c7c-e102d4c252fb/43ce579b-57d0-4201-91b9-f2436f925a50/images/4af22343-6df8-4c36-adc9-bca01e423e39/b63b475f-c08c-4be9-a577-2c6e262c9afe for disk The image is present (also prior starting the vm, no race): # ls -l /rhev/data-center/b08d859c-f654-4976-8c7c-e102d4c252fb/43ce579b-57d0-4201-91b9-f2436f925a50/images/4af22343-6df8-4c36-adc9-bca01e423e39 total 24 lrwxrwxrwx. 1 vdsm kvm 78 Feb 3 17:30 3d4da434-ccc7-486b-b29f-98d2651c2a1b -> /dev/43ce579b-57d0-4201-91b9-f2436f925a50/3d4da434-ccc7-486b-b29f-98d2651c2a1b lrwxrwxrwx. 1 vdsm kvm 78 Feb 3 17:33 85e55d89-d790-4ef1-8ef3-45113a787a8e -> /dev/43ce579b-57d0-4201-91b9-f2436f925a50/85e55d89-d790-4ef1-8ef3-45113a787a8e lrwxrwxrwx. 1 vdsm kvm 78 Feb 3 17:34 a5bea3ee-9f07-4253-bde2-106d15309feb -> /dev/43ce579b-57d0-4201-91b9-f2436f925a50/a5bea3ee-9f07-4253-bde2-106d15309feb lrwxrwxrwx. 1 vdsm kvm 78 Feb 4 11:14 b63b475f-c08c-4be9-a577-2c6e262c9afe -> /dev/43ce579b-57d0-4201-91b9-f2436f925a50/b63b475f-c08c-4be9-a577-2c6e262c9afe lrwxrwxrwx. 1 vdsm kvm 78 Feb 4 10:59 be749868-5666-4751-b449-fecc44eb228c -> /dev/43ce579b-57d0-4201-91b9-f2436f925a50/be749868-5666-4751-b449-fecc44eb228c lrwxrwxrwx. 1 vdsm kvm 78 Feb 4 10:58 d2cae84a-8f59-414c-9040-d4df7b9141ad -> /dev/43ce579b-57d0-4201-91b9-f2436f925a50/d2cae84a-8f59-414c-9040-d4df7b9141ad After that cgroups is probably not set correctly and the qemu process fails with: /rhev/data-center/b08d859c-f654-4976-8c7c-e102d4c252fb/43ce579b-57d0-4201-91b9-f2436f925a50/images/4af22343-6df8-4c36-adc9-bca01e423e39/b63b475f-c08c-4be9-a577-2c6e262c9afe: Operation not permitted (I encountered already EPERM associated with cgroup). There might also be something strange going on with lvm as the lv a5bea3ee-9f07-4253-bde2-106d15309fe is reported as not active: # lvs -o +tags | grep 4af22343-6df8-4c36-adc9-bca01e423e39 a5bea3ee-9f07-4253-bde2-106d15309feb 43ce579b-57d0-4201-91b9-f2436f925a50 -wi------ 1.00g PU_00000000-0000-0000-0000-000000000000,IU_4af22343-6df8-4c36-adc9-bca01e423e39,MD_22 b63b475f-c08c-4be9-a577-2c6e262c9afe 43ce579b-57d0-4201-91b9-f2436f925a50 -wi-a---- 1.00g MD_25,PU_be749868-5666-4751-b449-fecc44eb228c,IU_4af22343-6df8-4c36-adc9-bca01e423e39 be749868-5666-4751-b449-fecc44eb228c 43ce579b-57d0-4201-91b9-f2436f925a50 -wi-a---- 1.00g MD_24,IU_4af22343-6df8-4c36-adc9-bca01e423e39,PU_d2cae84a-8f59-414c-9040-d4df7b9141ad d2cae84a-8f59-414c-9040-d4df7b9141ad 43ce579b-57d0-4201-91b9-f2436f925a50 -wi-a---- 1.00g PU_a5bea3ee-9f07-4253-bde2-106d15309feb,IU_4af22343-6df8-4c36-adc9-bca01e423e39,MD_23 even though: # ls -l $(readlink -f /dev/43ce579b-57d0-4201-91b9-f2436f925a50/a5bea3ee-9f07-4253-bde2-106d15309feb) brw-rw----. 1 vdsm qemu 253, 41 Feb 3 17:34 /dev/dm-41 trying to activate it again fails with: # lvchange -ay 43ce579b-57d0-4201-91b9-f2436f925a50/a5bea3ee-9f07-4253-bde2-106d15309feb device-mapper: create ioctl on 43ce579b--57d0--4201--91b9--f2436f925a50-a5bea3ee--9f07--4253--bde2--106d15309feb failed: Device or resource busy I'm not sure if this weird status of lvm is the cause or one of the effects anyway this doesn't seem related to VDSM. Eric any idea?
Definitely an unintentional regression in 6.4 as part of my refactoring into virStorageFileGetMEtadataRecurse. The code is seeing that the block name is a symlink, and doing a complete canonicalize_file_name() down to /dev/dm-46, then trying to treat the backing file relative to the dirname of the canonical name; but /dev/../uuid doesn't exist. Rather, libvirt should be following the ../uuid backing file name relative to the dirname that existed prior to resolving the symlink of the basename. I'm working on a patch, but ran out of time today. Should we reassign or clone to libvirt?
Upstream patch series proposed: https://www.redhat.com/archives/libvir-list/2013-February/msg00340.html
*** Bug 896507 has been marked as a duplicate of this bug. ***
Fixed upstream as of v1.0.2-177-ga18452d with: commit a18452d0d2aedf65b83b24983ce5746cfe68e71a Author: Eric Blake <eblake@redhat.com> Date: Wed Feb 13 11:04:05 2013 -0700 storage: test backing chain traversal Testing our backing chain handling will make it much easier to ensure that we avoid issues in the future. If only I had written this test before I first caused several regressions... * tests/virstoragetest.c: New test. * tests/Makefile.am (test_programs): Build it. * .gitignore: Ignore new files. commit d1333dd0fbb944a7d5fe56180bcf6b0fd4f85e18 Author: Eric Blake <eblake@redhat.com> Date: Wed Feb 6 15:48:15 2013 -0700 storage: don't follow backing chain symlinks too eagerly If you have a qcow2 file /path1/to/file pointed to by symlink /path2/symlink, and pass qemu /path2/symlink, then qemu treats a relative backing file in the qcow2 metadata as being relative to /path2, not /path1/to. Yes, this means that it is possible to create a qcow2 file where the choice of WHICH directory and symlink you access its contents from will then determine WHICH backing file (if any) you actually find; the results can be rather screwy, but we have to match what qemu does. Libvirt and qemu default to creating absolute backing file names, so most users don't hit this. But at least VDSM uses symlinks and relative backing names alongside the --reuse-external flags to libvirt snapshot operations, with the result that libvirt was failing to follow the intended chain of backing files, and then backing files were not granted the necessary sVirt permissions to be opened by qemu. See https://bugzilla.redhat.com/show_bug.cgi?id=903248 for more gory details. This fixes a regression introduced in commit 8250783. I tested this patch by creating the following chain: ls /home/eblake/Downloads/Fedora.iso # raw file for base cd /var/lib/libvirt/images qemu-img create -f qcow2 \ -obacking_file=/home/eblake/Downloads/Fedora.iso,backing_fmt=raw one mkdir sub cd sub ln -s ../one onelink qemu-img create -f qcow2 \ -obacking_file=../sub/onelink,backing_fmt=qcow2 two mv two .. ln -s ../two twolink qemu-img create -f qcow2 \ -obacking_file=../sub/twolink,backing_fmt=qcow2 three mv three .. ln -s ../three threelink then pointing my domain at /var/lib/libvirt/images/sub/threelink. Prior to this patch, I got complaints about missing backing files; afterwards, I was able to verify that the backing chain (and hence DAC and SELinux relabels) of the entire chain worked. * src/util/virstoragefile.h (_virStorageFileMetadata): Add directory member. * src/util/virstoragefile.c (absolutePathFromBaseFile): Drop, replaced by... (virFindBackingFile): ...better function. (virStorageFileGetMetadataInternal): Add an argument. (virStorageFileGetMetadataFromFD, virStorageFileChainLookup) (virStorageFileGetMetadata): Update callers. commit 2485f921536659181c9afa30d4fc1ee99dc064ae Author: Eric Blake <eblake@redhat.com> Date: Wed Feb 6 16:33:45 2013 -0700 storage: refactor metadata lookup Prior to this patch, we had the callchains: external users \-> virStorageFileGetMetadataFromFD \-> virStorageFileGetMetadataFromBuf virStorageFileGetMetadataRecurse \-> virStorageFileGetMetadataFromFD \-> virStorageFileGetMetadataFromBuf However, a future patch wants to add an additional parameter to the bottom of the chain, for use by virStorageFileGetMetadataRecurse, without affecting existing external callers. Since there is only a single caller of the internal function, we can repurpose it to fit our needs, with this patch giving us: external users \-> virStorageFileGetMetadataFromFD \-> virStorageFileGetMetadataInternal virStorageFileGetMetadataRecurse / \-> virStorageFileGetMetadataInternal * src/util/virstoragefile.c (virStorageFileGetMetadataFromFD): Move most of the guts... (virStorageFileGetMetadataFromBuf): ...here, and rename... (virStorageFileGetMetadataInternal): ...to this. (virStorageFileGetMetadataRecurse): Use internal helper. commit b7df4f92d6a62f4ec7513af11590f77152e0a19b Author: Eric Blake <eblake@redhat.com> Date: Sat Feb 9 06:41:01 2013 -0700 storage: prepare for refactoring virStorageFileGetMetadataFromFD is the only caller of virStorageFileGetMetadataFromBuf; and it doesn't care about the difference between a return of 0 (total success) or 1 (metadata was inconsistent, but pointer was populated as best as possible); only about a return of -1 (could not read metadata or out of memory). Changing the return type, and normalizing the variable names used, will make merging the functions easier in the next commit. * src/util/virstoragefile.c (virStorageFileGetMetadataFromBuf): Change return value, and rename some variables. (virStorageFileGetMetadataFromFD): Rename some variables. commit 5e4946d4d98badb9982ff78d795e1bbd9829ccde Author: Eric Blake <eblake@redhat.com> Date: Wed Feb 6 16:54:08 2013 -0700 storage: rearrange functions No semantic change; done so the next patch doesn't need a forward declaration of a static function. * src/util/virstoragefile.c (virStorageFileProbeFormatFromBuf): Hoist earlier. Moving to POST.
I can reproduce this bug with libvirt-0.10.2-18.el6.x86_64 1) need RHEVM env and ISCSI storage 2) start a vm 3) create 3 live snapshot via rhevm 4) kill -9 vm's pid 5) try to start vm libvirtd log error : error : qemuProcessWaitForMonitor:1752 : internal error process exited while connecting to monitor: qemu-kvm: -drive file=/rhev/data-center/16550381-93bb-4c0a-b6ba-c16b73868978/9f38a056-aa10-4f2e-a58e-7ad7525ce6a2/images/f4747e80-33a4-45ab-b994-18d453c8b976/eb52e017-4d0c-4cb4-ae8c-4e948f4c5fde,if=none,id=drive-virtio-disk0,format=qcow2,serial=f4747e80-33a4-45ab-b994-18d453c8b976,cache=none,werror=stop,rerror=stop,aio=native: could not open disk image /rhev/data-center/16550381-93bb-4c0a-b6ba-c16b73868978/9f38a056-aa10-4f2e-a58e-7ad7525ce6a2/images/f4747e80-33a4-45ab-b994-18d453c8b976/eb52e017-4d0c-4cb4-ae8c-4e948f4c5fde: Operation not permitted Wenlong
*** Bug 922419 has been marked as a duplicate of this bug. ***
Verify this bug with libvirt-0.10.2-19.el6.x86_64 1) need RHEVM env and ISCSI storage 2) start a vm 3) create 3 live snapshot via rhevm 4) kill -9 vm's pid 5) try to start vm 6) vm start successfully without error like below: libvirtd log error : error : qemuProcessWaitForMonitor:1752 : internal error process exited while connecting to monitor: qemu-kvm: -drive file=/rhev/data-center/16550381-93bb-4c0a-b6ba-c16b73868978/9f38a056-aa10-4f2e-a58e-7ad7525ce6a2/images/f4747e80-33a4-45ab-b994-18d453c8b976/eb52e017-4d0c-4cb4-ae8c-4e948f4c5fde,if=none,id=drive-virtio-disk0,format=qcow2,serial=f4747e80-33a4-45ab-b994-18d453c8b976,cache=none,werror=stop,rerror=stop,aio=native: could not open disk image /rhev/data-center/16550381-93bb-4c0a-b6ba-c16b73868978/9f38a056-aa10-4f2e-a58e-7ad7525ce6a2/images/f4747e80-33a4-45ab-b994-18d453c8b976/eb52e017-4d0c-4cb4-ae8c-4e948f4c5fde: Operation not permitted so verify this bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1581.html