552487 – Guest image corruption after RHEV-H update to 5.4-2.1.3.el5_4rhev2_1 using virtio-blk

Bug 552487 - Guest image corruption after RHEV-H update to 5.4-2.1.3.el5_4rhev2_1 using virtio-blk

Summary: Guest image corruption after RHEV-H update to 5.4-2.1.3.el5_4rhev2_1 using vi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kvm
Sub Component:
Version:	5.4.z
Hardware:	All
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Kevin Wolf
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	560942 562776 562790 567940 577225
TreeView+	depends on / blocked

Reported:	2010-01-05 10:27 UTC by Dan Yasny
Modified:	2023-09-14 01:19 UTC (History)
CC List:	11 users (show)
Fixed In Version:	kvm-83-155.el5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	560942 562790 567940 (view as bug list)
Environment:
Last Closed:	2010-03-30 07:53:58 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2010:0271	0	normal	SHIPPED_LIVE	Important: kvm security, bug fix and enhancement update	2010-03-29 13:19:48 UTC

Description Dan Yasny 2010-01-05 10:27:09 UTC

Description of problem:
Customer was given a temporary custom build of the image corruption fix, which seemed to work. Then the update mentioned in $subj came out, and the customer updated, since the update was supposed to contain the fixes he was using temporarily. 
After the update the image corruption issue returned

Version-Release number of selected component (if applicable):
5.4-2.1.3.el5_4rhev2_1

[root@bwyhs0018p vdsm]# rpm -qa |grep kvm
kmod-kvm-83-105.el5_4.13
etherboot-zroms-kvm-5.4.4-10.el5
kvm-debuginfo-83-105.el5_4.13
kvm-qemu-img-83-105.el5_4.13
kvm-83-105.el5_4.13
kvm-tools-83-105.el5_4.13

[root@bwyhs0018p vdsm]# rpm -qa |grep vdsm
vdsm-reg-4.4-37677
vdsm-cli-4.4-37677
vdsm-4.4-37677


How reproducible:
intermittent

Steps to Reproduce:
1.run several RHEL5 VMs using an NFS storage
2.load the VMs with heavy disk and cpu load (customer compiles gcc in a loop)
3.turn off the nfs, and then back on
  
Actual results:
In the guest:

Jan  4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 656639
Jan  4 13:13:15 gatest-c kernel: Aborting journal on device dm-0.
Jan  4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 656640
Jan  4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 656641
Jan  4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 656642
Jan  4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 656643
Jan  4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 656644
Jan  4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 656645
Jan  4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 656646
Jan  4 13:13:15 gatest-c kernel: ext3_abort called.
Jan  4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal
Jan  4 13:13:15 gatest-c kernel: Remounting filesystem read-only
Jan  4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 656647
Jan  4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 656648
Jan  4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 656649

Expected results:
No corruption - guest should pause and continue when storage returns online

Additional info:

Comment 1 Dor Laor 2010-01-05 10:48:02 UTC

It should have been fixed by  Bug 540406 -  RHEL5.4 VM image corruption with an IDE v-disk

Comment 2 Dor Laor 2010-01-05 10:48:52 UTC

A new one - 550949 was also opened for a similar issue

Comment 3 Kevin Wolf 2010-01-08 14:04:30 UTC

Is this even IDE or is it virtio?

As far as I can tell, the patches that worked in the scratch build are contained in kvm-83-105.el5_4.13. At least they are mentioned in the changelog.

Comment 6 Chris Van Hoof 2010-01-11 19:12:43 UTC

(In reply to comment #3)
> Is this even IDE or is it virtio?
> 
> As far as I can tell, the patches that worked in the scratch build are
> contained in kvm-83-105.el5_4.13. At least they are mentioned in the changelog.    

The recent failures using kvm-83-105.el5_4.13 were all VirtIO based.

Comment 7 Kevin Wolf 2010-01-12 14:55:20 UTC

I can't reproduce this with the binary in the kvm-83-105.el5_4.13. With werror=stop the VM is paused when I stop the NFS server and without it, Linux sees the I/O errors and remounts the file system read-only without getting corruption.

This whole report looks like the customer was running an unpatched version again. Are you 100% sure that the rpm -qa output is from the right machine and this version is really running?

For reference, I tested with the binary from https://brewweb.devel.redhat.com/buildinfo?buildID=119205. The md5sum of qemu-kvm is bf464978c52cb4e99e8795644c940017.

Comment 8 Chris Van Hoof 2010-01-12 15:58:13 UTC

(In reply to comment #7)
> I can't reproduce this with the binary in the kvm-83-105.el5_4.13. With
> werror=stop the VM is paused when I stop the NFS server and without it, Linux
> sees the I/O errors and remounts the file system read-only without getting
> corruption.
> 
> This whole report looks like the customer was running an unpatched version
> again. Are you 100% sure that the rpm -qa output is from the right machine and
> this version is really running?
> 
> For reference, I tested with the binary from
> https://brewweb.devel.redhat.com/buildinfo?buildID=119205. The md5sum of
> qemu-kvm is bf464978c52cb4e99e8795644c940017.    

We've seen two more corruption cases today:

"""
All the nodes have
kmod-kvm-83-105.el5_4.13
etherboot-zroms-kvm-5.4.4-10.el5
kvm-debuginfo-83-105.el5_4.13
kvm-qemu-img-83-105.el5_4.13
kvm-83-105.el5_4.13
kvm-tools-83-105.el5_4.13

# md5sum /usr/libexec/qemu-kvm
bf464978c52cb4e99e8795644c940017  /usr/libexec/qemu-kvm
"""

What can we do next to further narrow this down?  If you would like access to the image, or any data I can gather it fairly quickly.

--chris

Comment 9 Chris Van Hoof 2010-01-12 19:34:59 UTC

Additionally the three new instances of the corruption we've seen occurred twice on a sparse RAW image, and once on a sparse COW image.

Comment 10 Kevin Wolf 2010-01-13 09:20:41 UTC

Christoph, any idea? For me this looks very much like virtio-blk reading garbage instead of getting an error. Something with the same effect as what was fixed for bug 531827. Just that we haven't reproduced it locally so far.

Comment 11 chellwig@redhat.com 2010-01-13 14:12:11 UTC

This very much looks like I/O errors from the filesystems.

First, are you really sure you are running the updated version and haven't just installed it, that is restarted all qemu instances after the upgrade?

If that's out of question I'd really love to see some more traces from qemu internally, could you run an instrumented qemu binary so we can figure out more about this?

Comment 12 Dan Yasny 2010-01-13 14:29:31 UTC

yes, the VMs were all restarted, actually, the upgrade of the rhev-h node restarts the node as well, so we are definitely running the correct versions.

We have also checked the md5sums of the binaries, and they match

Comment 24 Chris Ward 2010-02-11 10:07:24 UTC

~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 25 Miya Chen 2010-03-02 08:16:49 UTC

Test virtio block in kvm-83-160.el5 with both raw and qcow2, guest can stop on read error.(Tried 5 times for each format)

steps:
1. mount nfs server and create test disk:
# mount localhost:/root/test-nfs /mnt -o rw,soft,timeo=1,retrans=0
# qemu-img create test-552487.raw -f raw 200M
Formatting 'test-552487.raw', fmt=raw, size=204800 kB
# qemu-io test-552487.raw
qemu-io> write -P 97 0 50M
wrote 52428800/52428800 bytes at offset 0
50 MiB, 1 ops; 0.0000 sec (69.333 MiB/sec and 1.3867 ops/sec)
qemu-io> write -P 98 50M 50M
wrote 52428800/52428800 bytes at offset 52428800
50 MiB, 1 ops; 0.0000 sec (75.489 MiB/sec and 1.5098 ops/sec)
qemu-io> write -P 99 100M 50M
wrote 52428800/52428800 bytes at offset 104857600
50 MiB, 1 ops; 0.0000 sec (74.699 MiB/sec and 1.4940 ops/sec)
qemu-io> write -P 100 150M 50M
wrote 52428800/52428800 bytes at offset 157286400
50 MiB, 1 ops; 0.0000 sec (74.988 MiB/sec and 1.4998 ops/sec)
qemu-io> quit
# md5sum test-552487.raw
ab5593b62c6e9fb1448e778bdd3c4d00  test-552487.raw

2.start guest:
/usr/libexec/qemu-kvm -no-hpet -usbdevice tablet -rtc-td-hack -smp 2 -m 4G
-drive file=RHEL-Server-5.4-64-virtio.qcow2,if=virtio,boot=on -net
nic,vlan=0,macaddr=20:88:99:11:20:61,model=e1000 -net
tap,vlan=0,script=/etc/qemu-ifup -uuid `uuidgen` -cpu qemu64,+sse2 -vnc :10
-monitor stdio -notify all -M rhel5.5.0 -startdate now -drive
file=/mnt/test-552487.raw,cache=off,if=virtio,werror=stop

3. in guest:
dd if=/dev/vdb of=/dev/null

4. in host:
service nfs stop

5. In host dmesg:
nfs: server localhost not responding, timed out


6. In qemu monitor
(qemu) # VM is stopped due to disk write error: ide0-hd0: Input/output error
(qemu)info status
VM status: paused
(qemu) info blockstats
virtio0: rd_bytes=343747072 wr_bytes=39061504 rd_operations=10039 wr_operations=3771
virtio1: rd_bytes=195314176 wr_bytes=0 rd_operations=380498 wr_operations=0

7. in host:
service nfs start

8. In qemu monitor
(qemu)c 

9. Tried for 5 times, and then check:
# md5sum test-552487.raw
ab5593b62c6e9fb1448e778bdd3c4d00  test-552487.raw

Comment 28 errata-xmlrpc 2010-03-30 07:53:58 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0271.html

Comment 30 Red Hat Bugzilla 2023-09-14 01:19:16 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.