Hide Forgot
Description of problem: "STOPPING" RHEL5 Thin Provisioned QCOW2 guests on RHEL6 RHEV hosts often results in disk corruption. Version-Release number of selected component (if applicable): qemu-kvm-0.12.1.2-2.144.el6.x86_64 How reproducible: About 60% of the time. Steps to Reproduce: 1. Create a RHEL 5.6 server template in RHEV-M. 2. Create a RHEL 5.6 pool: a) "Pools" tab b) "New" button c) Name and Description d) Number of VMs: 2 e) Based on my RHEL5.6 template f) 2GB Memory g) 4 Cores h) 2 Sockets i) Red Hat Enterprise Linux 5.x x64 Operating System j) "More" tab k) Automatic Pool Type l) VNC Protocol Console m) Run On Any Host in Cluster n) Clone Provisioning in my Guest Storage Domain o) OK button 3. Wait for the two guests to be created and in the Down state 4. Start the guests and wait for them to be fully up. 5. Stop (do not Shutdown) the guests. 6. Start the guests and check to see if they boot. If it boots clean, goto step 5 until you see on the console that it is stuck at a GRUB prompt (indicating /etc is messed up). Actual results: The contents of files on the disk are seemingly randomly garbled. For example, I had one round where the contents of /etc/passwd were the contents of /etc/resolv.conf. Other times there was a binary file that replaced /etc/grub/grub.conf. Almost as if the inodes just got scrambled. Expected results: There may be some issue with the filesystem, but an fsck should fix the glitch on boot-up. Here, things are so messed up that grub can't even get to stage 1 to boot at all. Additional info: I mentioned this finding to jkt at my 1-on-1 with him this afternoon, and wanted your feedback before proceeding. I find that about 60% of the time, my thin provisioned RHEL5 guests on top of my RHEL6 hosts will get very, very severe disk corruption when stopped (regardless of WebUI, rhevm_fence, or straight REST-API call). This is stop, not shutdown. Now, the identical thing does NOT happen with RHEL6 guests on top of the same RHEL6 hosts. Also, pre-allocated RHEL5 and RHEL6 guests never run into this problem. Now, from my perspective of working on RHCS on top of RHEV, it is best that this type of server workload SHOULD be on pre-allocated disks. However, as one CAN put themselves into this situation and the damage is very, very bad, I thought I would ask your opinion on going forward. The options I'm thinking of: 1) Document in the RHCS guides that we require you to pre-allocate RHEL5 guests on top of RHEV, and strongly recommend you pre-allocate RHEL6 guests. 2) Write up a RHEL5 bug to fix whatever is wrong in the guest that it gets corrupt, when it's RHEL6 counterpart does not (no idea what component this would even be). 3) Write up a RHEL6 bug to fix whatever is wrong on the host that it corrupts RHEL5 (and maybe other) thin provisioned guests. Again, no idea what component this would be. 4) Write up a RHEV bug, again unknown component, and have them try to figure out where the breakage occurs. 5) I can spend more time on this to try and narrow down to where the breakage is occurring, but I'm not a filesystem expert. I could at least try the identical thing under KVM and see if we can at least take RHEV out of the equation. I can also try additional types of guests to see if it's RHEL5 guests only, or just guests that are NOT RHEL6, or some other combination. Or if you have another idea, that's good too. This just touches too many things for me to have a real great answer where we want to dedicate resources.
What versions of kvm/qemu-img were used? gpxe-roms-qemu-0.9.7-6.4.el6.noarch qemu-img-0.12.1.2-2.144.el6.x86_64 qemu-kvm-0.12.1.2-2.144.el6.x86_64 What guest disk device was used? IDE or virtio? VirtIO, specifically: Name: Disk 1 Size: 10 GB Actual Size: 1 GB Type: System Format: COW Allocation: Thin Provision Interface: VirtIO Date Created: 2011-Mar-08, 10:40
[AB] I would go for this option. Open a bz on KVM in RHEL6. That would be this new bug :) [AB] Are you using NFS or block devices? Block, specifically iSCSI. [AB] what is the vdisk size? Size: 10 GB Actual Size: 1 GB [AB] can you send vdsm logs? Will do right after this comment. [AB] how many hosts do you have in your setup? 4 [AB] if the host running the VM is not the SPM in RHEVM, please send also the vdsm log from the spm node Will do right after this comment.
(In reply to comment #3) > What versions of kvm/qemu-img were used? > > gpxe-roms-qemu-0.9.7-6.4.el6.noarch > qemu-img-0.12.1.2-2.144.el6.x86_64 > qemu-kvm-0.12.1.2-2.144.el6.x86_64 Please upgrade your qemu-kvm package and retest. qcow2 is broken in -144.
(In reply to comment #21) > (In reply to comment #3) > > What versions of kvm/qemu-img were used? > > > > gpxe-roms-qemu-0.9.7-6.4.el6.noarch > > qemu-img-0.12.1.2-2.144.el6.x86_64 > > qemu-kvm-0.12.1.2-2.144.el6.x86_64 > > Please upgrade your qemu-kvm package and retest. qcow2 is broken in -144. After trying a sniff test of upgrading to vdsm-4.9-52.el6.x86_64 and getting a failure, I decided to upgrade all other packages as well,but roll-back to the known working vdsm-4.9-47.el6.x86_64. The behavior was exactly the same with this rolled-back vdsm, but newer everything else. So, the testing as described in Comment 22 was actually done with: gpxe-roms-qemu-0.9.7-6.7.el6.noarch qemu-img-0.12.1.2-2.149.el6.x86_64 qemu-kvm-0.12.1.2-2.149.el6.x86_64 with the identical results as the original description.