Bug 1067576 - RHEV: Cannot start VMs that have more than 23 snapshots.
Summary: RHEV: Cannot start VMs that have more than 23 snapshots.
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: qemu-kvm
Version: 6.5
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: rc
: 6.5
Assignee: Jeff Cody
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Keywords: ZStream
: 1071023 (view as bug list)
Depends On:
Blocks: 1023565 1071023 1071740 1072302 1072339 1113583
TreeView+ depends on / blocked
 
Reported: 2014-02-20 16:38 UTC by Gordon Watson
Modified: 2019-04-28 08:39 UTC (History)
27 users (show)

(edit)
Previously, the number of characters in the file name strings for virtual machine (VM) images was limited. Because repeated creation of VM image snapshots gradually increases the file name string size, this eventually resulted in either of the following problems when the size limit was reached: creating an image snapshot failed to be executed or the VM did not successfully boot. This update fixes the handling of file names so that long file names are now supported, and the mentioned problems no longer occur.
Clone Of:
: 1071023 1072339 1113583 (view as bug list)
(edit)
Last Closed: 2014-10-14 06:55:57 UTC


Attachments (Terms of Use)
Script to reproduce relative pathname bug with just qemu-kvm and qemu-img (2.38 KB, text/plain)
2014-02-25 18:07 UTC, Jeff Cody
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2014:1490 normal SHIPPED_LIVE qemu-kvm bug fix and enhancement update 2014-10-14 01:28:27 UTC
Red Hat Knowledge Base (Article) 730773 None None None Never

Description Gordon Watson 2014-02-20 16:38:07 UTC
Description of problem:


Version-Release number of selected component (if applicable):

RHEV 3.2, 3.3
vdsm-4.13.2-0.10
libvirt-0.10.2-29.el6_5.3
qemu-kvm-rhev-0.12.1.2-2.415.el6_5.3


How reproducible:

The problem specifically described here requires a VM with more than 24 snapshots in order to reproduce. However, the same condition can be reproduced by creating a VM and adding snapshots until the volume clone sequence fails (see the steps below).


Steps to Reproduce:
1.  Create a new VM with a single 1gb disk.
2.  Keep creating snapshots.
3.  Should encounter "VolumeCreationError: Error creating a new volume".


Actual results:

Can no longer start a VM that has more than 24 snapshots. 
Cannot create more then 24 snapshots.


Expected results:

For the above to work.


Additional info:

See subsequent bug comments for details.

Comment 8 Gordon Watson 2014-02-20 22:23:16 UTC
Problem Description:

Somehow I omitted the problem description in the initial comment. Sorry about that. So, to make up for that, here it is.

After upgrading a RHEV host to 'vdsm-4.13.2-0.x' and 'qemu-kvm-rhev-0.12.1.2-2.415*', a VM with more than 24 snapshots will fail to start. In the RHEV Admin Portal the customer sees "Unable to read from monitor: Connection reset by peer". This is also reported in the vdsm logs on the host on which the VM tried to run. The qemu log for the VM shows "could not open disk image", "No such file or directory" for the current active image.

While troubleshooting this I performed some tests. I created a new VM with a single disk and started to create snapshots. The 24th snapshot failed to be created. The 'qemu-img create' failed with "could not open disk image", "No such file or directory".

Comment 11 Qunfang Zhang 2014-02-21 09:40:44 UTC
My first attempt to reproduced the bug is failed.  Test the qemu-kvm-rhev-0.12.1.2-2.415.el6_5.4.x86_64 for both live snapshot creation and offline snapshot creation, both can not reproduce the bug. 


CLI:
 /usr/libexec/qemu-kvm -cpu SandyBridge -M rhel6.5.0 -enable-kvm -m 2048 -smp 2,sockets=2,cores=1,threads=1 -name rhel6.4-64 -uuid 9a0e67ec-f286-d8e7-0548-0c1c9ec93009 -nodefconfig -nodefaults -monitor stdio -rtc base=utc,clock=host,driftfix=slew -no-kvm-pit-reinjection -no-shutdown  -drive file=sn33,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:d5:51:8a,bus=pci.0,addr=0x3 -chardev socket,id=charserial0,server,nowait,path=/tmp/isa-serial -device isa-serial,chardev=charserial0,id=serial0  -vnc :11 -vga std  -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -qmp tcp:0:5566,server,nowait -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0

(1) Live snapshot creation:

(qemu) info block
drive-virtio-disk0: removable=0 io-status=ok file=sn33 backing_file=sn32 ro=0 drv=qcow2 encrypted=0 bps=0 bps_rd=0 bps_wr=0 iops=0 iops_rd=0 iops_wr=0
drive-ide0-1-0: removable=1 locked=0 tray-open=0 io-status=ok [not inserted]
(qemu) 
(qemu) snapshot_blkdev drive-virtio-disk0 sn34

Keep creating snapshot for more than 30, succeed.  Shutdown guest and boot again for the sn34 image, succeed. 

(2) Offline snapshot creation:

#qemu-img create -f qcow2 -F qcow2 -b RHEL-Server-6.5-64-virtio.qcow2 sn1

Repeat until the 30th snapshot created, succeed. Boot up the 30th snapshot, no failure happens.

Comment 12 Kevin Wolf 2014-02-21 09:52:07 UTC
Can you try upgrading only one package (vdsm or qemu-kvm, but not both) in order
to find out which one causes the problem?

Your above comments suggest that qemu-img check can reproduce the problem, so
if the problem was in qemu, the following should have reproduced it to my
understanding:

$ qemu-img create -f qcow2 /tmp/sn1.qcow2 4G
Formatting '/tmp/sn1.qcow2', fmt=qcow2 size=4294967296 encryption=off cluster_size=65536 
$ for i in $(seq 1 32); do qemu-img create -f qcow2 -b /tmp/sn$i.qcow2 /tmp/sn$((i+1)).qcow2; done
Formatting '/tmp/sn2.qcow2', fmt=qcow2 size=4294967296 backing_file='/tmp/sn1.qcow2' encryption=off cluster_size=65536 
Formatting '/tmp/sn3.qcow2', fmt=qcow2 size=4294967296 backing_file='/tmp/sn2.qcow2' encryption=off cluster_size=65536 
[...]
Formatting '/tmp/sn33.qcow2', fmt=qcow2 size=4294967296 backing_file='/tmp/sn32.qcow2' encryption=off cluster_size=65536 
$ qemu-img check -f qcow2 /tmp/sn33.qcow2
No errors were found on the image.
Image end offset: 262144

So this case works for me. I also tried replacing sn1.qcow2 with a real image
and it booted up okay. My suspicion at this moment is that VDSM didn't prepare
the LVs correctly, so that some of these UUID monsters can't be opened (either
because they don't exist or permissions are missing).

Another thing you could try is qemu-img info --backing-chain $TOP_LEVEL_IMAGE

Comment 13 Jeff Cody 2014-02-21 13:19:48 UTC
I just tried reproducing on -415 with using live snapshots, with QMP commands:

{ "execute": "qmp_capabilities" }
{"return": {}}

{ "execute": "blockdev-snapshot-sync", "arguments": { "device": "virtio0","snapshot-file":"/tmp/snap1.qcow2","format": "qcow2" } }
Formatting '/tmp/snap2.qcow2', fmt=qcow2 size=272730423296 backing_file='/tmp/snap1.qcow2' backing_fmt='qcow2' encryption=off cluster_size=65536 
{"return": {}}

...

{ "execute": "blockdev-snapshot-sync", "arguments": { "device": "virtio0","snapshot-file":"/tmp/snap33.qcow2","format": "qcow2" } }
Formatting '/tmp/snap33.qcow2', fmt=qcow2 size=272730423296 backing_file='/tmp/snap32.qcow2' backing_fmt='qcow2' encryption=off cluster_size=65536 
{"return": {}}

So I was not able to reproduce either with just qemu-kvm and live snapshots.  Next step is probably to try with virsh / libvirt.

Comment 14 Jeff Cody 2014-02-21 14:13:02 UTC
Tried reproducing with virsh as well:

for i in $(seq 1 32); do virsh snapshot-create-as f16 f16-sn${i} --diskspec vda,file=/var/lib/libvirt/images/f16-sn${i}.qcow2 --disk-only --atomic; done

And it worked fine: 

Domain snapshot f16-sn1 created

...

Domain snapshot f16-sn32 created


virsh snapshot-list also looks correct.

Comment 16 Jeff Cody 2014-02-21 14:23:26 UTC
Federico,

Could you see if you can reproduce this with vdsm?

Comment 19 Kevin Wolf 2014-02-21 18:33:14 UTC
Gordon gave me access to a failing setup, so I could have a look there. After
activating all LVs in the backing file chain, what I saw is this:

# qemu-img info --backing-chain /rhev/data-center/b22eb742-7727-459d-b941-12f526310878/1d9d1e9d-5b6c-4dcc-9db6-c994cb65b135/images/e57eb496-d39b-4789-b36a-8ff4667a2137/7a1c63be-d354-432f-be34-539368be4d7f
Could not open '/rhev/data-center/b22eb742-7727-459d-b941-12f526310878/1d9d1e9d-5b6c-4dcc-9db6-c994cb65b135/images/e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/e515a76': No such file or directory

This means that while resolving the backing file path relative to the parent
image, the path became longer and longer (adding one ../$DIR/ instance per
backing file) until it reached a limit of 1023 characters, where it was
truncated.


qemu-img check fails while trying to open this file (check with strace):

/rhev/data-center/b22eb742-7727-459d-b941-12f526310878/1d9d1e9d-5b6c-4dcc-9db6-c994cb65b135/images/e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../../e57eb496-d39b-4789-b36a-8ff4667a2137/2db2a047-bb03-4a51-87ee-b27baacac502

Note the ../../ part before the last element. It is at the same place in the
filename string as the truncated element above for qemu-img info
--backing-chain, but it's clearly looking a bit different.

Comment 20 Jeff Cody 2014-02-21 19:02:58 UTC
(In reply to Kevin Wolf from comment #19)
> Gordon gave me access to a failing setup, so I could have a look there. After
> activating all LVs in the backing file chain, what I saw is this:
> 
> # qemu-img info --backing-chain
> /rhev/data-center/b22eb742-7727-459d-b941-12f526310878/1d9d1e9d-5b6c-4dcc-
> 9db6-c994cb65b135/images/e57eb496-d39b-4789-b36a-8ff4667a2137/7a1c63be-d354-
> 432f-be34-539368be4d7f
> Could not open
> '/rhev/data-center/b22eb742-7727-459d-b941-12f526310878/1d9d1e9d-5b6c-4dcc-
> 9db6-c994cb65b135/images/e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-
> d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../
> e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/
> ../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-
> 8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-
> b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-
> 4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-
> d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../
> e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/
> ../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-
> 8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-
> b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-
> 4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/e515a76': No
> such file or directory
> 
> This means that while resolving the backing file path relative to the parent
> image, the path became longer and longer (adding one ../$DIR/ instance per
> backing file) until it reached a limit of 1023 characters, where it was
> truncated.
> 
> 
> qemu-img check fails while trying to open this file (check with strace):
> 
> /rhev/data-center/b22eb742-7727-459d-b941-12f526310878/1d9d1e9d-5b6c-4dcc-
> 9db6-c994cb65b135/images/e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-
> d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../
> e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/
> ../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-
> 8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-
> b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-
> 4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-
> d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../
> e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/
> ../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-
> 8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-
> b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../e57eb496-d39b-
> 4789-b36a-8ff4667a2137/../e57eb496-d39b-4789-b36a-8ff4667a2137/../../
> e57eb496-d39b-4789-b36a-8ff4667a2137/2db2a047-bb03-4a51-87ee-b27baacac502
> 
> Note the ../../ part before the last element. It is at the same place in the
> filename string as the truncated element above for qemu-img info
> --backing-chain, but it's clearly looking a bit different.

Excellent, thanks.  I am able to reproduce the bug now using qemu-img (works on -355, doesn't work on -415) using redundant relative paths > 1024 characters.

Comment 21 Jeff Cody 2014-02-22 00:56:57 UTC
There are 3 commits I've found since -355 that causes problems with image creation issues when dealing with a relative patchname + filename > 1024 characters.

The first is commit 387ea7cb8345bdcc900ffe145cbcdcf65da91257:
qemu-img: make "info" backing file output correct and easier to use

This created a new function, bdrv_get_full_backing_filename(), to generate the backing filename with correct pathing.  Rather than use the passed filename string, it used bs->filename, which is truncated to 1024 bytes / PATH_MAX.

The second is commit 4fa7281349ab24ca9a2ae226cf6881fa383b4dda:
qcow2: Flush image after creation

For a qcow2 image, at the end of the image creation, the newly created image has its backing file set.  This commit then closed the image, and opened it again, to clear the BDRV_O_NO_FLUSH internal block flag.  This bdrv_open() failed, because the backing filename was truncated again to 1024 / PATH_MAX.


For testing, I've created 2 patches to keep the same functionality, but restore behavior to the -355 behavior.  The first patch allows passing the filename to bdrv_get_full_backing_filename(), which changes 387ea7cb back to the old behavior.

The second patch uses bdrv_reopen() to clear the BDRV_O_NO_FLUSH flag, which prevents the closing and opening of the image introduced in 4fa7281.

HOWEVER, this leaves the 3rd problem commit, a fix for which is not so straightforward.  While the previous two fixes allow the snapshot creation, the backing file string in the qcow2 header of the snapshot image is truncated to 1024.  This obviously creates an invalid image file.

This behavior was introduced with
commit 36335a65701ce25ba30b4b4610584b6c872b9cbe
qcow2: Simplify image creation

Prior to this, we would directly write the passed backing_file string into the newly created qcow2 image file.

After this commit, the qcow2 image file creation now uses the common block functions, and calls bdrv_change_backing_file() to write the backing file string to the qcow2 header.  This runs the backing_file string, which is > 1024 bytes, through the BlockDriverState structures which truncates the string to 1024 bytes.

A short term fix for this could be to make the BDS filename and backing_file strings dynamically allocated.  We could then work on a longer-term fix.  

And incidentally, with a dynamically allocated string, the previous 2 commits mentioned should not be an issue (although I still think the proper method for clearing the NO_CACHE flag in 4fa7281 is via bdrv_reopen, not bdrv_close + bdrv_open).

The real problem is the limitation of 1024 bytes for BDS filename and backing_file strings; we were just lucky in -355 that it worked, because we bypassed our own structures in just the right places.

Comment 22 Qunfang Zhang 2014-02-24 04:49:47 UTC
Reproduced the issue on qemu-kvm-rhev-0.12.1.2-2.415.el6_5.4.x86_64. As Kevin and Jeff said, when the backing file name > 1024 characters, failed to create the snapshot. 

[root@localhost e57eb496-d39b-4789-b36a-8ff4667a2137]# qemu-img create -f qcow2 -F qcow2 -b /home/qzhang/test/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/disk.qcow2 /home/qzhang/test/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/sn1
Formatting '/home/qzhang/test/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/sn1', fmt=qcow2 size=1073741824 backing_file='/home/qzhang/test/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/disk.qcow2' backing_fmt='qcow2' encryption=off cluster_size=65536 
/home/qzhang/test/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/e57eb496-d39b-4789-b36a-8ff4667a2137/sn1: error while creating qcow2: No such file or directory

Comment 23 Kevin Wolf 2014-02-24 09:29:08 UTC
(In reply to Jeff Cody from comment #21)
> HOWEVER, this leaves the 3rd problem commit, a fix for which is not so
> straightforward.  While the previous two fixes allow the snapshot creation,
> the backing file string in the qcow2 header of the snapshot image is
> truncated to 1024.  This obviously creates an invalid image file.
> 
> This behavior was introduced with
> commit 36335a65701ce25ba30b4b4610584b6c872b9cbe
> qcow2: Simplify image creation
> 
> Prior to this, we would directly write the passed backing_file string into
> the newly created qcow2 image file.
> 
> After this commit, the qcow2 image file creation now uses the common block
> functions, and calls bdrv_change_backing_file() to write the backing file
> string to the qcow2 header.  This runs the backing_file string, which is >
> 1024 bytes, through the BlockDriverState structures which truncates the
> string to 1024 bytes.

The image format has always been limiting the backing file length to 1023
characters in the spec. Writing a longer backing file name would have caused
qcow2_open() to truncate it.

The real question here is why you even get a long string. We should be writing
a short relative path like "../dir/image" to the qcow2 header. It's only when
you combine it with the path of the parent image that you get very long paths.

Comment 24 Jeff Cody 2014-02-24 12:39:34 UTC
(In reply to Kevin Wolf from comment #23)
> (In reply to Jeff Cody from comment #21)
> > HOWEVER, this leaves the 3rd problem commit, a fix for which is not so
> > straightforward.  While the previous two fixes allow the snapshot creation,
> > the backing file string in the qcow2 header of the snapshot image is
> > truncated to 1024.  This obviously creates an invalid image file.
> > 
> > This behavior was introduced with
> > commit 36335a65701ce25ba30b4b4610584b6c872b9cbe
> > qcow2: Simplify image creation
> > 
> > Prior to this, we would directly write the passed backing_file string into
> > the newly created qcow2 image file.
> > 
> > After this commit, the qcow2 image file creation now uses the common block
> > functions, and calls bdrv_change_backing_file() to write the backing file
> > string to the qcow2 header.  This runs the backing_file string, which is >
> > 1024 bytes, through the BlockDriverState structures which truncates the
> > string to 1024 bytes.
> 
> The image format has always been limiting the backing file length to 1023
> characters in the spec. Writing a longer backing file name would have caused
> qcow2_open() to truncate it.
> 
> The real question here is why you even get a long string. We should be
> writing
> a short relative path like "../dir/image" to the qcow2 header. It's only when
> you combine it with the path of the parent image that you get very long
> paths.

I'm not convinced that -355 would have been able to successfully open the resulting image file described the description (is there any confirmation of that, aside from being able / not being able to create snapshots)?

However, -355 will create a qcow2 image file with a header that has a backing file name >= 1024 characters.  -415 will not.

Gordon, when you created the additional snapshots under -355, that were then subsequently not able to be opened by -415, were they able to be opened by -355?

And Kevin, I agree - something seems odd about the path creation, and that should be investigated as well, and that is most likely in the vdsm level.

Comment 26 Jeff Cody 2014-02-25 00:08:20 UTC
The problem I described in comment #21 is slightly different than what appears to be happening.

Here is what appears to be happening: each snapshot is made with a relative pathname for the backing file.  The relative pathname references the parent directory and then the current directory, like so:

tstA
├── base.qcow2
├── sn1.qcow2   (backing file ../tstA/base.qcow2)
├── sn2.qcow2   (backing file ../tstA/sn1.qcow2)
└── sn3.qcow2   (backing file ../tstA/sn2.qcow2)

... etc.

So the backing file length, as stored in the qcow2 image file, does not exceed any length limitations in itself.

However, when the image file is opened, the paths are combined internally (via bdrv_get_full_backing_filename() and then path_combine()).  For my example above, this means we will try to open, for the base image:  "./../tstA../tstA../tstA/base.qcow2"

So as the number of snapshots increase, the filename+pathname string size to open base.qcow2 (and intermediate snapshots) likewise increases.

The length limitation of 'backing_file' and 'filename' in the BDS structure is 1024 bytes.  However, in bdrv_open() there is a temporary char array, 'backing_filename', that is 4096 bytes (more specifically, PATH_MAX under Linux).

Commit 387ea7cb, as described in comment #21, ended up limiting this internal combined string size to 1024 bytes.  Prior to 387ea7cb, the internal max would have been limited to 4096 bytes.

If this is the issue, then while -415 is different than -355, we will still run into the same problem on -355.

Gordon, could you do 2 things please?
1.) Try -355 with ~100 snapshots instead of 24.  I am guess you will hit the same issue by snapshot 96 or 97.

2.) Try the rpms from this brew build, to see if it my test compile (-415 plus 1 patch) behaves like -355:  https://brewweb.devel.redhat.com/taskinfo?taskID=7102063  (I placed them in the /tmp directory of the test machine, and they are currently installed.  Look for /tmp/*jtc.bz1067576.v1.x86_64.rpm if you need to reinstall them)

Note that 'qemu-img info --backing-chain' did not exist in -355.  With this patch, it will still not work in -415.el6_5.jtc.bz1067576.v1 , because the filename goes through the BDS filename string, which is limited to 1024 bytes.

If Gordon can confirm that this patched qemu-kvm version has the same behavior as -355, we should likely do 2 things as a real long term fix:
A) dynamically allocate filename and backing file strings, and paths
B) flatten out redundant relative pathnames as much as possible (or just resolve them internally to absolute pathnames)

Comment 27 Gordon Watson 2014-02-25 15:19:02 UTC
Jeff,

As we discussed, I installed the '415.el6_5.jtc.bz1067576.v1' rpms and was able to create snapshots, both live and offline, beyond the previous limit, was able to start the VM and after shutting it down, was able to delete a snapshot. So, it looks good.

Regarding creating ~100 snapshots, I will try to do this as a background task today, if possible. I may not be able to get that high, as I'm limited by the size of the storage domain that I'm using. Obviously I could extend it, but that will require extra resources and time, etc. 

Thanks very much, GFW.

Comment 28 Jeff Cody 2014-02-25 18:07:37 UTC
Created attachment 867573 [details]
Script to reproduce relative pathname bug with just qemu-kvm and qemu-img

Attached is a script to reproduce this bug with just qemu-kvm, and qemu-img.

This can be seen via either live snapshots, or image creation with qemu-img.

This script will do the following:

1) create test directory
2) create a qcow2 base image
3) launch qemu-kvm with qmp over localhost tcp
4) attempt to create 80 live snapshots with a relative pathname, to parent directory and then back into the test directory
5) run qemu-img, create a different set of 80 snapshots off the same base image.
6) kill the process started in step #3

All you should need to edit in the script is the executable path for qemu and qemu-img.

The test directory is left after the script runs, with the created snapshots and an output log, for later examination.

Expected outcome:
-355:  All snapshot created successfully (live and via qemu-img)
-415:  First ~24 snapshots succeed, those after that fail

Any fix should restore it back to -355 parity.

If you edit the script to create 150 snapshots instead of 80, you will see -355 fail as well.

Comment 30 Jeff Cody 2014-02-28 18:22:32 UTC
*** Bug 1071023 has been marked as a duplicate of this bug. ***

Comment 33 Miroslav Rezanina 2014-03-04 07:48:26 UTC
Fix included in qemu-kvm-0.12.1.2-2.422.el6

Comment 36 Qunfang Zhang 2014-03-05 06:14:13 UTC
Hello, Jeff

I want to confirm with you what need to be tested to verify this bug.  (1) Are comment 22 + comment 28 scenario enough?  (2) Do we need to run some function test run for this bug? As we know we have two rhel6.5-z bugs in hand. If any function test needed, please tell us and we will arrange it for the rhel6.5-z errata.  (3) Is comment 29 still an existing problem? 

Thanks,
Qunfang

Comment 37 Jeff Cody 2014-03-05 13:18:23 UTC
(In reply to Qunfang Zhang from comment #36)
> Hello, Jeff
> 
> I want to confirm with you what need to be tested to verify this bug.  (1)
> Are comment 22 + comment 28 scenario enough?  (2) Do we need to run some
> function test run for this bug? As we know we have two rhel6.5-z bugs in
> hand. If any function test needed, please tell us and we will arrange it for
> the rhel6.5-z errata.  (3) Is comment 29 still an existing problem? 
> 
> Thanks,
> Qunfang

Hi Qunfang,

Comment 22 should not be used as a test - that would test something subtly different, that is not part of the actual original bug.  The actual bug is with the internal path concatenation of the relative paths, not an externally long path (that is handled differently, and will still fail).

For testing from QEMU's perspective alone, then comment #28 / attachment #867573 [details] should be used.

Jeff

Comment 38 Qunfang Zhang 2014-03-06 03:16:16 UTC
(In reply to Jeff Cody from comment #37)
> (In reply to Qunfang Zhang from comment #36)
> > Hello, Jeff
> > 
> > I want to confirm with you what need to be tested to verify this bug.  (1)
> > Are comment 22 + comment 28 scenario enough?  (2) Do we need to run some
> > function test run for this bug? As we know we have two rhel6.5-z bugs in
> > hand. If any function test needed, please tell us and we will arrange it for
> > the rhel6.5-z errata.  (3) Is comment 29 still an existing problem? 
> > 
> > Thanks,
> > Qunfang
> 
> Hi Qunfang,
> 
> Comment 22 should not be used as a test - that would test something subtly
> different, that is not part of the actual original bug.  The actual bug is
> with the internal path concatenation of the relative paths, not an
> externally long path (that is handled differently, and will still fail).
> 
> For testing from QEMU's perspective alone, then comment #28 / attachment
> #867573 [details] should be used.
> 
> Jeff

Hi, Jeff

Thanks a lot for the feedback. And we verified the two rhel6.5-z bug 1071740 and bug 1072302 with your comment 28 steps and script. Both are passed. Is it enough to verify the bug?  Do we need to run some additional function test? 

Thanks,
Qunfang

Comment 39 Jeff Cody 2014-03-06 16:42:09 UTC
(In reply to Qunfang Zhang from comment #38)
> 
> Hi, Jeff
> 
> Thanks a lot for the feedback. And we verified the two rhel6.5-z bug 1071740
> and bug 1072302 with your comment 28 steps and script. Both are passed. Is
> it enough to verify the bug?  Do we need to run some additional function
> test? 
> 
> Thanks,
> Qunfang

Hi Qunfang,

I think that what you have done so far is technically all that is needed, however since this is a customer-reported bug that was seen from the perspective of RHEV, it may be worth verifying the original reproduction method used by the reporter (Gordon). I do not know those steps, however, but Gordon was able to verify the fix, so perhaps he is able to provide reproduction / verification steps from RHEV if they are not too complicated.

Thanks,
Jeff

Comment 44 Shaolong Hu 2014-06-20 05:24:19 UTC
Verified on :

with scripts https://bugzilla.redhat.com/attachment.cgi?id=867573:

[root@localhost ~]# ./sh
Formatting 'base.qcow2', fmt=qcow2 size=1073741824 encryption=off cluster_size=65536 
Performing live snapshot test
-----------------------------
Waiting to connect to qmp socket...
./sh: connect: Connection refused
./sh: line 44: /dev/tcp/localhost/44444: Connection refused
connected!
snapshot-0: success!
snapshot-1: success!
snapshot-2: success!
...
snapshot-76: success!
snapshot-77: success!
snapshot-78: success!
snapshot-79: success!

Performing qemu-img snapshot test
---------------------------------
snapshot-0: success!
snapshot-1: success!
snapshot-2: success!
snapshot-3: success!
...
snapshot-77: success!
snapshot-78: success!
snapshot-79: success!

Killing qemu process (23702)....
done, left directory this-is-a-test-dir-for-snapshots-23698 intact

Comment 45 Shaolong Hu 2014-06-20 05:24:52 UTC
verified on qemu-kvm-rhev-0.12.1.2-2.427.el6.x86_64

Comment 46 errata-xmlrpc 2014-10-14 06:55:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1490.html


Note You need to log in before you can comment on or make changes to this bug.