Description of problem: Setup details: RHEV-H hosts, running sp211, kmod-kvm-83-105.el5_4.8 kvm-tools-83-105.el5_4.8 etherboot-zroms-kvm-5.4.4-10.el5 kvm-debuginfo-83-105.el5_4.8 kvm-qemu-img-83-105.el5_4.8 kvm-83-105.el5_4.8 Using Mellanox 10G NICs to connect to NFS storage, based on a Netapp filer. Two RHEL5.4 templates, with a few VMs created from the template Template 1: Using VirtIO v-disk - VMs get corrupt when system loses connectivity to the storage Template 2: Using IDE v-disk - VMs pause when system loses storage connectivity How reproducible: Randomly, but often enough Steps to Reproduce: 1. 2. 3. Actual results: The root O/S filesystem for a RHEV guest running RHEL-5.4 became corrupted and marked as readonly, whilst running a sosreport. The following errors are in the messages file: Sep 30 05:36:43 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 951745 in dir #951720 Sep 30 05:36:43 CHP10go kernel: Aborting journal on device vda2. Sep 30 05:36:43 CHP10go kernel: __journal_remove_journal_head: freeing b_committed_data Sep 30 05:36:43 CHP10go kernel: __journal_remove_journal_head: freeing b_committed_data Sep 30 05:36:43 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 951746 in dir #951720 Sep 30 05:36:43 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 951747 in dir #951720 Sep 30 05:36:43 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 951748 in dir #951720 Sep 30 05:36:43 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 951749 in dir #951720 Sep 30 05:36:43 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 951750 in dir #951720 Sep 30 05:36:43 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 951751 in dir #951720 Sep 30 05:36:43 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 951752 in dir #951720 Sep 30 05:36:43 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 951753 in dir #951720 Sep 30 05:36:43 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 951754 in dir #951720 Sep 30 05:36:43 CHP10go kernel: ext3_abort called. Sep 30 05:36:43 CHP10go kernel: EXT3-fs error (device vda2): ext3_journal_start_sb: Detected aborted journal Sep 30 05:36:43 CHP10go kernel: Remounting filesystem read-only Sep 30 05:36:54 CHP10go kernel: attempt to access beyond end of device Sep 30 05:36:54 CHP10go kernel: vda2: rw=0, want=34359738368, limit=8385930 Sep 30 05:36:54 CHP10go kernel: attempt to access beyond end of device Sep 30 05:36:54 CHP10go kernel: vda2: rw=0, want=34359738368, limit=8385930 Sep 30 05:36:54 CHP10go kernel: printk: 6 messages suppressed. Sep 30 05:36:54 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 983731 in dir #983357 Sep 30 05:36:54 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 983732 in dir #983357 Expected results: VMs to pause, and resume when storage is lost Additional info: In case
Is that a RHEV guest or a RHEL guest? It's the first time I heard RHEV is a guest. If it's true, can you ask them to replay it with standard RHEL guest. \
(In reply to comment #1) > Is that a RHEV guest or a RHEL guest? > It's the first time I heard RHEV is a guest. > If it's true, can you ask them to replay it with standard RHEL guest. Dor -- This is a standard RHEL 5.4 installation underneath a RHEV Hypervisor managed by RHEV M. Is there any specific info you would like to see from one of the guests? --chris
Dor, if you require additional information, you can contact Chris (he's the TAM for this customer), or me
The customer reproduced the issue on a 1Gb NIC, so this is not a 10G issue The latest corruption: Oct 30 12:56:29 222c kernel: EXT3-fs error (device vda2): ext3_free_blocks_sb: bit already cleared for block 765824 Oct 30 12:56:29 222c kernel: Aborting journal on device vda2. Oct 30 12:56:29 222c kernel: EXT3-fs error (device vda2) in ext3_reserve_inode_write: Journal has aborted Oct 30 12:56:29 222c kernel: EXT3-fs error (device vda2) in ext3_truncate: Journal has aborted Oct 30 12:56:29 222c kernel: EXT3-fs error (device vda2) in ext3_reserve_inode_write: Journal has aborted Oct 30 12:56:29 222c kernel: EXT3-fs error (device vda2) in ext3_orphan_del: Journal has aborted Oct 30 12:56:29 222c kernel: EXT3-fs error (device vda2) in ext3_reserve_inode_write: Journal has aborted Oct 30 12:56:29 222c kernel: EXT3-fs error (device vda2) in ext3_delete_inode: Journal has aborted Oct 30 12:56:29 222c kernel: __journal_remove_journal_head: freeing b_committed_data Oct 30 12:56:29 222c kernel: __journal_remove_journal_head: freeing b_committed_data Oct 30 12:56:29 222c kernel: ext3_abort called. Oct 30 12:56:29 222c kernel: EXT3-fs error (device vda2): ext3_journal_start_sb: Detected aborted journal Oct 30 12:56:29 222c kernel: Remounting filesystem read-only
(In reply to comment #4) > The customer reproduced the issue on a 1Gb NIC, so this is not a 10G issue When this failure occurred they were running a stress test for over a day. The test is a simple recompile of upstream gcc in a loop. It was well into day two when they saw the corruption, and the fs was mounted read only. Please let me know if you need any other data here. --chris
I can see the VMs being stopped in vdsm.log: vdsm.log.1.gz:QMon-756509::DEBUG::2009-10-30 12:01:24,121::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: Input/output error vdsm.log.1.gz:QMon-765619::DEBUG::2009-10-30 12:01:24,697::QemuMonitor::229::vds.vmlog.1b555a43-b789-4f0b-9ac9-b0c19d7872c1::# VM is stopped due to disk write error: ide0-hd0: Input/output error vdsm.log.1.gz:QMon-769615::DEBUG::2009-10-30 12:01:26,029::QemuMonitor::229::vds.vmlog.be77e39a-2ce3-427a-b43b-0f63bba0a176::# VM is stopped due to disk write error: virtio0: Input/output error vdsm.log.1.gz:QMon-769615::DEBUG::2009-10-30 12:01:26,029::QemuMonitor::229::vds.vmlog.be77e39a-2ce3-427a-b43b-0f63bba0a176::# VM is stopped due to disk write error: virtio0: Input/output error vdsm.log.1.gz:QMon-769615::DEBUG::2009-10-30 12:01:26,029::QemuMonitor::229::vds.vmlog.be77e39a-2ce3-427a-b43b-0f63bba0a176::# VM is stopped due to disk write error: virtio0: Input/output error vdsm.log.2.gz:QMon-739523::DEBUG::2009-10-30 05:03:37,181::QemuMonitor::229::vds.vmlog.32d39ed6-5a7d-4c6f-9fcc-e8f22e857172::# VM is stopped due to disk write error: virtio0: No space left on device vdsm.log.2.gz:QMon-731858::DEBUG::2009-10-30 05:03:44,418::QemuMonitor::229::vds.vmlog.45937870-c797-4882-af3a-c043b11deadf::# VM is stopped due to disk write error: virtio0: No space left on device vdsm.log.2.gz:QMon-731858::DEBUG::2009-10-30 05:03:44,419::QemuMonitor::229::vds.vmlog.45937870-c797-4882-af3a-c043b11deadf::# VM is stopped due to disk write error: virtio0: No space left on device vdsm.log.2.gz:QMon-731858::DEBUG::2009-10-30 05:03:44,419::QemuMonitor::229::vds.vmlog.45937870-c797-4882-af3a-c043b11deadf::# VM is stopped due to disk write error: virtio0: No space left on device vdsm.log.2.gz:QMon-731858::DEBUG::2009-10-30 05:03:44,420::QemuMonitor::229::vds.vmlog.45937870-c797-4882-af3a-c043b11deadf::# VM is stopped due to disk write error: virtio0: No space left on device vdsm.log.2.gz:QMon-727238::DEBUG::2009-10-30 05:03:44,729::QemuMonitor::229::vds.vmlog.1b555a43-b789-4f0b-9ac9-b0c19d7872c1::# VM is stopped due to disk write error: ide0-hd0: No space left on device vdsm.log.2.gz:QMon-747228::DEBUG::2009-10-30 05:03:50,128::QemuMonitor::229::vds.vmlog.2801d18d-3716-4e27-8f70-9efeabb9c7e3::# VM is stopped due to disk write error: virtio0: No space left on device vdsm.log.2.gz:QMon-747228::DEBUG::2009-10-30 05:03:50,129::QemuMonitor::229::vds.vmlog.2801d18d-3716-4e27-8f70-9efeabb9c7e3::# VM is stopped due to disk write error: virtio0: No space left on device vdsm.log.2.gz:QMon-747228::DEBUG::2009-10-30 05:03:50,130::QemuMonitor::229::vds.vmlog.2801d18d-3716-4e27-8f70-9efeabb9c7e3::# VM is stopped due to disk write error: virtio0: No space left on device vdsm.log.2.gz:QMon-747228::DEBUG::2009-10-30 05:03:50,131::QemuMonitor::229::vds.vmlog.2801d18d-3716-4e27-8f70-9efeabb9c7e3::# VM is stopped due to disk write error: virtio0: No space left on device vdsm.log.2.gz:QMon-751844::DEBUG::2009-10-30 05:19:42,662::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device vdsm.log.2.gz:QMon-752208::DEBUG::2009-10-30 05:26:36,985::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device vdsm.log.2.gz:QMon-752525::DEBUG::2009-10-30 05:32:37,301::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device vdsm.log.2.gz:QMon-752932::DEBUG::2009-10-30 05:40:37,553::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device vdsm.log.2.gz:QMon-753293::DEBUG::2009-10-30 05:47:37,809::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device vdsm.log.2.gz:QMon-753658::DEBUG::2009-10-30 05:54:37,918::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device vdsm.log.2.gz:QMon-754068::DEBUG::2009-10-30 06:02:38,358::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device vdsm.log.2.gz:QMon-754473::DEBUG::2009-10-30 06:10:38,591::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device vdsm.log.2.gz:QMon-754790::DEBUG::2009-10-30 06:16:38,842::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device vdsm.log.2.gz:QMon-755108::DEBUG::2009-10-30 06:22:39,140::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device vdsm.log.2.gz:QMon-755468::DEBUG::2009-10-30 06:29:39,289::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device vdsm.log.2.gz:QMon-755784::DEBUG::2009-10-30 06:35:40,560::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device vdsm.log.2.gz:QMon-756149::DEBUG::2009-10-30 06:42:39,728::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device vdsm.log.2.gz:QMon-756509::DEBUG::2009-10-30 06:49:39,736::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device vdsm.log.5.gz:QMon-569::DEBUG::2009-10-29 07:51:53,615::QemuMonitor::229::vds.vmlog.9f382326-7423-4941-a7c8-368dad53fb44::# VM is stopped due to disk write error: virtio0: No space left on device vdsm.log.5.gz:QMon-611::DEBUG::2009-10-29 07:52:02,340::QemuMonitor::229::vds.vmlog.45937870-c797-4882-af3a-c043b11deadf::# VM is stopped due to disk write error: virtio0: No space left on device
Apararntly, seems like Uli's analysis is correct and we get timeout errors from NFS. Since currently we only translate them into no_space, this what happens for IDE. For virtio, from the above code, seems like write returns error while read is ignored. It's not good either. My recommendation is to handle all errors as ENOSPAC for this release and let mgmt resume the guest (although it does not relate to lack of storage). We need to make sure that read errors are also handled.
This bug can be reproduced in kvm-105.4.10. step: 1. create a local nfs server in host, then mount localhost:/root/test-nfs /mnt -o rw,soft,timeo=1,retrans=0 cd /mnt qemu-img create test-virtio.qcow2 -f qcow2 10G 2. start guest: /usr/libexec/qemu-kvm -rtc-td-hack -no-hpet -usbdevice tablet -drive file=rhel5u4-64-virtio.qcow2,if=virtio,boot=on -cpu qemu64,+sse2 -smp 2 -m 2G -net nic,macaddr=20:20:20:33:34:66,model=virtio,vlan=1 -net tap,script=/etc/qemu-ifup,vlan=1 -drive file=/mnt/test-virtio.qcow2,format=qcow2,cache=off,werror=stop,if=virtio -vnc :10 -name rhel54_64 -monitor stdio 3. in guest: fdisk /dev/hdb mkfs.ext3 /dev/hdb1 mount /dev/hdb1 /mnt cp -r /usr/share /mnt 4. then in host: gdb -p 31766 (gdb) break virtio_blk_rw_complete (gdb) continue 5. in host: service nfs stop 6. in host: service nfs start 7. (gdb) quit Actual result: guest dmesg: end_request: I/O error, dev vdb, sector 5292199 Buffer I/O error on device vdb1, logical block 661517 lost page write due to I/O error on vdb1 Buffer I/O error on device vdb1, logical block 661518 lost page write due to I/O error on vdb1 Buffer I/O error on device vdb1, logical block 661519 lost page write due to I/O error on vdb1 Buffer I/O error on device vdb1, logical block 661520 lost page write due to I/O error on vdb1 Buffer I/O error on device vdb1, logical block 661521 lost page write due to I/O error on vdb1 Buffer I/O error on device vdb1, logical block 661522 lost page write due to I/O error on vdb1 Buffer I/O error on device vdb1, logical block 661523 lost page write due to I/O error on vdb1 Buffer I/O error on device vdb1, logical block 661524 lost page write due to I/O error on vdb1 Buffer I/O error on device vdb1, logical block 661525 lost page write due to I/O error on vdb1 Buffer I/O error on device vdb1, logical block 661526 lost page write due to I/O error on vdb1 Aborting journal on device vdb1. ext3_abort called. EXT3-fs error (device vdb1): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only ext3_abort called. EXT3-fs error (device vdb1): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only
Test with build kvm-83-136 again, VM can stop on read errors. steps: 1. create a local nfs server in host, then mount localhost:/root/test-nfs /mnt -o rw,soft,timeo=1,retrans=0 cd /mnt qemu-img create test-533390.qcow2 -f qcow2 10G 2. start guest: /usr/libexec/qemu-kvm -rtc-td-hack -no-hpet -usbdevice tablet -cpu qemu64,+sse2 -drive file=RHEL-Server-5.4-64-virtio.qcow2,if=virtio,boot=on,format=qcow2,cache=off,werror=stop -smp 2 -m 2G -vnc :1 -net nic,macaddr=20:20:20:11:12:56,model=virtio,vlan=0 -net tap,script=/etc/qemu-ifup,vlan=0 -monitor stdio -drive file=/mnt/test-533390.qcow2,if=virtio,format=qcow2,cache=off,werror=stop 3. in guest: fdisk /dev/vdb mkfs.ext3 /dev/vdb1 dd if=/dev/vdb1 of=/dev/null 4. in host: service nfs stop 5. In host dmesg: nfs: server localhost not responding, timed out 6. In qemu monitor (qemu)info status VM status: paused 7. in host: service nfs start 8. In qemu monitor (qemu)c Guest works fine after restore.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0271.html
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days