Description of problem: Customer was given a temporary custom build of the image corruption fix, which seemed to work. Then the update mentioned in $subj came out, and the customer updated, since the update was supposed to contain the fixes he was using temporarily. After the update the image corruption issue returned Version-Release number of selected component (if applicable): 5.4-2.1.3.el5_4rhev2_1 [root@bwyhs0018p vdsm]# rpm -qa |grep kvm kmod-kvm-83-105.el5_4.13 etherboot-zroms-kvm-5.4.4-10.el5 kvm-debuginfo-83-105.el5_4.13 kvm-qemu-img-83-105.el5_4.13 kvm-83-105.el5_4.13 kvm-tools-83-105.el5_4.13 [root@bwyhs0018p vdsm]# rpm -qa |grep vdsm vdsm-reg-4.4-37677 vdsm-cli-4.4-37677 vdsm-4.4-37677 How reproducible: intermittent Steps to Reproduce: 1.run several RHEL5 VMs using an NFS storage 2.load the VMs with heavy disk and cpu load (customer compiles gcc in a loop) 3.turn off the nfs, and then back on Actual results: In the guest: Jan 4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 656639 Jan 4 13:13:15 gatest-c kernel: Aborting journal on device dm-0. Jan 4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 656640 Jan 4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 656641 Jan 4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 656642 Jan 4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 656643 Jan 4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 656644 Jan 4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 656645 Jan 4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 656646 Jan 4 13:13:15 gatest-c kernel: ext3_abort called. Jan 4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal Jan 4 13:13:15 gatest-c kernel: Remounting filesystem read-only Jan 4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 656647 Jan 4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 656648 Jan 4 13:13:15 gatest-c kernel: EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for block 656649 Expected results: No corruption - guest should pause and continue when storage returns online Additional info:
It should have been fixed by Bug 540406 - RHEL5.4 VM image corruption with an IDE v-disk
A new one - 550949 was also opened for a similar issue
Is this even IDE or is it virtio? As far as I can tell, the patches that worked in the scratch build are contained in kvm-83-105.el5_4.13. At least they are mentioned in the changelog.
(In reply to comment #3) > Is this even IDE or is it virtio? > > As far as I can tell, the patches that worked in the scratch build are > contained in kvm-83-105.el5_4.13. At least they are mentioned in the changelog. The recent failures using kvm-83-105.el5_4.13 were all VirtIO based.
I can't reproduce this with the binary in the kvm-83-105.el5_4.13. With werror=stop the VM is paused when I stop the NFS server and without it, Linux sees the I/O errors and remounts the file system read-only without getting corruption. This whole report looks like the customer was running an unpatched version again. Are you 100% sure that the rpm -qa output is from the right machine and this version is really running? For reference, I tested with the binary from https://brewweb.devel.redhat.com/buildinfo?buildID=119205. The md5sum of qemu-kvm is bf464978c52cb4e99e8795644c940017.
(In reply to comment #7) > I can't reproduce this with the binary in the kvm-83-105.el5_4.13. With > werror=stop the VM is paused when I stop the NFS server and without it, Linux > sees the I/O errors and remounts the file system read-only without getting > corruption. > > This whole report looks like the customer was running an unpatched version > again. Are you 100% sure that the rpm -qa output is from the right machine and > this version is really running? > > For reference, I tested with the binary from > https://brewweb.devel.redhat.com/buildinfo?buildID=119205. The md5sum of > qemu-kvm is bf464978c52cb4e99e8795644c940017. We've seen two more corruption cases today: """ All the nodes have kmod-kvm-83-105.el5_4.13 etherboot-zroms-kvm-5.4.4-10.el5 kvm-debuginfo-83-105.el5_4.13 kvm-qemu-img-83-105.el5_4.13 kvm-83-105.el5_4.13 kvm-tools-83-105.el5_4.13 # md5sum /usr/libexec/qemu-kvm bf464978c52cb4e99e8795644c940017 /usr/libexec/qemu-kvm """ What can we do next to further narrow this down? If you would like access to the image, or any data I can gather it fairly quickly. --chris
Additionally the three new instances of the corruption we've seen occurred twice on a sparse RAW image, and once on a sparse COW image.
Christoph, any idea? For me this looks very much like virtio-blk reading garbage instead of getting an error. Something with the same effect as what was fixed for bug 531827. Just that we haven't reproduced it locally so far.
This very much looks like I/O errors from the filesystems. First, are you really sure you are running the updated version and haven't just installed it, that is restarted all qemu instances after the upgrade? If that's out of question I'd really love to see some more traces from qemu internally, could you run an instrumented qemu binary so we can figure out more about this?
yes, the VMs were all restarted, actually, the upgrade of the rhev-h node restarts the node as well, so we are definitely running the correct versions. We have also checked the md5sums of the binaries, and they match
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative.
Test virtio block in kvm-83-160.el5 with both raw and qcow2, guest can stop on read error.(Tried 5 times for each format) steps: 1. mount nfs server and create test disk: # mount localhost:/root/test-nfs /mnt -o rw,soft,timeo=1,retrans=0 # qemu-img create test-552487.raw -f raw 200M Formatting 'test-552487.raw', fmt=raw, size=204800 kB # qemu-io test-552487.raw qemu-io> write -P 97 0 50M wrote 52428800/52428800 bytes at offset 0 50 MiB, 1 ops; 0.0000 sec (69.333 MiB/sec and 1.3867 ops/sec) qemu-io> write -P 98 50M 50M wrote 52428800/52428800 bytes at offset 52428800 50 MiB, 1 ops; 0.0000 sec (75.489 MiB/sec and 1.5098 ops/sec) qemu-io> write -P 99 100M 50M wrote 52428800/52428800 bytes at offset 104857600 50 MiB, 1 ops; 0.0000 sec (74.699 MiB/sec and 1.4940 ops/sec) qemu-io> write -P 100 150M 50M wrote 52428800/52428800 bytes at offset 157286400 50 MiB, 1 ops; 0.0000 sec (74.988 MiB/sec and 1.4998 ops/sec) qemu-io> quit # md5sum test-552487.raw ab5593b62c6e9fb1448e778bdd3c4d00 test-552487.raw 2.start guest: /usr/libexec/qemu-kvm -no-hpet -usbdevice tablet -rtc-td-hack -smp 2 -m 4G -drive file=RHEL-Server-5.4-64-virtio.qcow2,if=virtio,boot=on -net nic,vlan=0,macaddr=20:88:99:11:20:61,model=e1000 -net tap,vlan=0,script=/etc/qemu-ifup -uuid `uuidgen` -cpu qemu64,+sse2 -vnc :10 -monitor stdio -notify all -M rhel5.5.0 -startdate now -drive file=/mnt/test-552487.raw,cache=off,if=virtio,werror=stop 3. in guest: dd if=/dev/vdb of=/dev/null 4. in host: service nfs stop 5. In host dmesg: nfs: server localhost not responding, timed out 6. In qemu monitor (qemu) # VM is stopped due to disk write error: ide0-hd0: Input/output error (qemu)info status VM status: paused (qemu) info blockstats virtio0: rd_bytes=343747072 wr_bytes=39061504 rd_operations=10039 wr_operations=3771 virtio1: rd_bytes=195314176 wr_bytes=0 rd_operations=380498 wr_operations=0 7. in host: service nfs start 8. In qemu monitor (qemu)c 9. Tried for 5 times, and then check: # md5sum test-552487.raw ab5593b62c6e9fb1448e778bdd3c4d00 test-552487.raw
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0271.html
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days