920020 – qemu-img delete snapshot causes corruption under high IO load

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 920020 - qemu-img delete snapshot causes corruption under high IO load

Summary: qemu-img delete snapshot causes corruption under high IO load

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	6.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Kevin Wolf
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	810856
TreeView+	depends on / blocked

Reported:	2013-03-11 07:47 UTC by Roman Hodain
Modified:	2018-11-30 19:24 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-03-20 11:31:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Roman Hodain 2013-03-11 07:47:14 UTC

Description of problem:
When creating an internal snapshot of  virtual machine with High IO load and then remove this snapshot. The QCOW2 image is corrupted.

Version-Release number of selected component (if applicable):
   qemu-kvm-0.12.1.2-2.355.el6_4.1.x86_64

How reproducible:
   Always with high IO load

Steps to Reproduce:
1. create VM with QCoW2 disk format
2. Install RHEL on that VM
3.generate IO load
   example:
     for i in `seq 1 20`; do
	dd if=/dev/zero of=/dev/VolGroup/test&
	dd if=/dev/VolGroup/test of=/dev/null&
     done 

4. proceed with the following script on the host system:

    #!/bin/bash -x
    SNAPDATE=`date +%d-%m-%Y`
    TS=`date +%d%m%y-%H%m%S`
    i=/var/lib/libvirt/images/test.img
    qemu-img snapshot -c "$SNAPDATE" $i
    qemu-img convert -f qcow2 -O qcow2 -s "$SNAPDATE" $i $i-snapshot
    virsh suspend test
    qemu-img snapshot -d "$SNAPDATE" $i
    virsh resume test
  
Actual results:
qemu-img check return errors:
   # qemu-img check /var/lib/libvirt/images/test.img 
      ERROR OFLAG_COPIED: l2_offset=8000000000040000 refcount=2
      ERROR OFLAG_COPIED: offset=8000000000050000 refcount=2
      ERROR OFLAG_COPIED: offset=8000000000060000 refcount=2
      ...
      Leaked cluster 4 refcount=2 reference=1
      Leaked cluster 5 refcount=2 reference=1
      ...
      28220 errors were found on the image.
      Data may be corrupted, or further writes to the image may corrupt it.

      28222 leaked clusters were found on the image.
      This means waste of disk space, but no harm to data.



Expected results:
No errors detected

Additional info:

Comment 5 Kevin Wolf 2013-03-20 11:31:45 UTC

(In reply to comment #0)
> 4. proceed with the following script on the host system:
> 
>     #!/bin/bash -x
>     SNAPDATE=`date +%d-%m-%Y`
>     TS=`date +%d%m%y-%H%m%S`
>     i=/var/lib/libvirt/images/test.img
>     qemu-img snapshot -c "$SNAPDATE" $i
>     qemu-img convert -f qcow2 -O qcow2 -s "$SNAPDATE" $i $i-snapshot
>     virsh suspend test
>     qemu-img snapshot -d "$SNAPDATE" $i
>     virsh resume test

qcow2 images must not be used in read-write mode from two processes at the same
time. You can either have them opened either by one read-write process or by
many read-only processes. Having one (paused) read-write process (the running
VM) and additional read-only processes (copying out a snapshot with qemu-img)
may happen to work in practice, but you're on your own and we won't give
support for such attempts.

Additionally, internal snapshots are not supported in RHEL either. If you can
do without support, this is how it _should_ work on an upstream qemu:

  1. Pause the VM
  2. Take an internal snapshot with the 'savevm' command of the qemu monitor
     of the running VM, not with an external qemu-img process. virsh may or may
     not provide an interface for this.
  3. You can resume the VM now
  4. qemu-img convert -f qcow2 -O qcow2 -s "$SNAPDATE" $i $i-snapshot
  5. Pause the VM again
  6. 'delvm' in the qemu monitor
  7. Resume the VM

Note that I said in upstream qemu. This is because RHEL's qemu-img doesn't even
have the -s option. This also means that your customer can't possibly have used
the RHEL 6 version of qemu-img if his script didn't give him errors.

So to summarize, we have three reasons why we can't accept this bug:

- Opening a qcow2 image r/w from two processes is always wrong and corruption
  is expected in such scenarios
- Internal snapshots are unsupported on RHEL
- The customer obviously didn't use RHEL binaries

Closing as NOTABUG for the first and third one. The second one would make it
WONTFIX, if in doubt.

Note You need to log in before you can comment on or make changes to this bug.