Bug 531827 - O/S Filesystem Corruption with RHEL-5.4 on a RHEV Guest
Summary: O/S Filesystem Corruption with RHEL-5.4 on a RHEV Guest
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kvm
Version: 5.4
Hardware: All
OS: Linux
urgent
high
Target Milestone: rc
: ---
Assignee: Kevin Wolf
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 533178 533390 537334 537637
TreeView+ depends on / blocked
 
Reported: 2009-10-29 15:37 UTC by Dan Yasny
Modified: 2023-09-14 01:18 UTC (History)
14 users (show)

Fixed In Version: kvm-83-136.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 533178 537637 (view as bug list)
Environment:
Last Closed: 2010-03-30 07:52:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0271 0 normal SHIPPED_LIVE Important: kvm security, bug fix and enhancement update 2010-03-29 13:19:48 UTC

Description Dan Yasny 2009-10-29 15:37:13 UTC
Description of problem:
Setup details:
RHEV-H hosts, running sp211, 
kmod-kvm-83-105.el5_4.8
kvm-tools-83-105.el5_4.8
etherboot-zroms-kvm-5.4.4-10.el5
kvm-debuginfo-83-105.el5_4.8
kvm-qemu-img-83-105.el5_4.8
kvm-83-105.el5_4.8

Using Mellanox 10G NICs to connect to NFS storage, based on a Netapp filer.

Two RHEL5.4 templates, with a few VMs created from the template
Template 1: Using VirtIO v-disk - VMs get corrupt when system loses connectivity to the storage

Template 2: Using IDE v-disk - VMs pause when system loses storage connectivity




How reproducible:
Randomly, but often enough


Steps to Reproduce:
1.
2.
3.
  
Actual results:
The root O/S filesystem for a RHEV guest running RHEL-5.4 became corrupted and marked as readonly, whilst running a sosreport. The following errors are in the messages file:

Sep 30 05:36:43 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 951745 in dir #951720
Sep 30 05:36:43 CHP10go kernel: Aborting journal on device vda2.
Sep 30 05:36:43 CHP10go kernel: __journal_remove_journal_head: freeing b_committed_data
Sep 30 05:36:43 CHP10go kernel: __journal_remove_journal_head: freeing b_committed_data
Sep 30 05:36:43 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 951746 in dir #951720
Sep 30 05:36:43 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 951747 in dir #951720
Sep 30 05:36:43 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 951748 in dir #951720
Sep 30 05:36:43 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 951749 in dir #951720
Sep 30 05:36:43 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 951750 in dir #951720
Sep 30 05:36:43 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 951751 in dir #951720
Sep 30 05:36:43 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 951752 in dir #951720
Sep 30 05:36:43 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 951753 in dir #951720
Sep 30 05:36:43 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 951754 in dir #951720
Sep 30 05:36:43 CHP10go kernel: ext3_abort called.
Sep 30 05:36:43 CHP10go kernel: EXT3-fs error (device vda2): ext3_journal_start_sb: Detected aborted journal
Sep 30 05:36:43 CHP10go kernel: Remounting filesystem read-only
Sep 30 05:36:54 CHP10go kernel: attempt to access beyond end of device
Sep 30 05:36:54 CHP10go kernel: vda2: rw=0, want=34359738368, limit=8385930
Sep 30 05:36:54 CHP10go kernel: attempt to access beyond end of device
Sep 30 05:36:54 CHP10go kernel: vda2: rw=0, want=34359738368, limit=8385930
Sep 30 05:36:54 CHP10go kernel: printk: 6 messages suppressed.
Sep 30 05:36:54 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 983731 in dir #983357
Sep 30 05:36:54 CHP10go kernel: EXT3-fs error (device vda2): ext3_lookup: unlinked inode 983732 in dir #983357

Expected results:
VMs to pause, and resume when storage is lost

Additional info:

In case

Comment 1 Dor Laor 2009-10-29 20:18:26 UTC
Is that a RHEV guest or a RHEL guest?
It's the first time I heard RHEV is a guest.
If it's true, can you ask them to replay it with standard RHEL guest.

\

Comment 2 Chris Van Hoof 2009-10-29 20:49:15 UTC
(In reply to comment #1)
> Is that a RHEV guest or a RHEL guest?
> It's the first time I heard RHEV is a guest.
> If it's true, can you ask them to replay it with standard RHEL guest.

Dor -- This is a standard RHEL 5.4 installation underneath a RHEV Hypervisor managed by RHEV M.  Is there any specific info you would like to see from one of the guests?

--chris

Comment 3 Dan Yasny 2009-10-30 09:07:01 UTC
Dor, if you require additional information, you can contact Chris (he's the TAM for this customer), or me

Comment 4 Dan Yasny 2009-11-02 09:17:41 UTC
The customer reproduced the issue on a 1Gb NIC, so this is not a 10G issue

The latest corruption:

Oct 30 12:56:29 222c kernel: EXT3-fs error (device vda2):
ext3_free_blocks_sb: bit already cleared for block 765824
Oct 30 12:56:29 222c kernel: Aborting journal on device vda2.
Oct 30 12:56:29 222c kernel: EXT3-fs error (device vda2) in
ext3_reserve_inode_write: Journal has aborted
Oct 30 12:56:29 222c kernel: EXT3-fs error (device vda2) in
ext3_truncate: Journal has aborted
Oct 30 12:56:29 222c kernel: EXT3-fs error (device vda2) in
ext3_reserve_inode_write: Journal has aborted
Oct 30 12:56:29 222c kernel: EXT3-fs error (device vda2) in
ext3_orphan_del: Journal has aborted
Oct 30 12:56:29 222c kernel: EXT3-fs error (device vda2) in
ext3_reserve_inode_write: Journal has aborted
Oct 30 12:56:29 222c kernel: EXT3-fs error (device vda2) in
ext3_delete_inode: Journal has aborted
Oct 30 12:56:29 222c kernel: __journal_remove_journal_head: freeing
b_committed_data
Oct 30 12:56:29 222c kernel: __journal_remove_journal_head: freeing
b_committed_data
Oct 30 12:56:29 222c kernel: ext3_abort called.
Oct 30 12:56:29 222c kernel: EXT3-fs error (device vda2):
ext3_journal_start_sb: Detected aborted journal
Oct 30 12:56:29 222c kernel: Remounting filesystem read-only

Comment 6 Chris Van Hoof 2009-11-03 22:11:28 UTC
(In reply to comment #4)
> The customer reproduced the issue on a 1Gb NIC, so this is not a 10G issue

When this failure occurred they were running a stress test for over a day.  The test is a simple recompile of upstream gcc in a loop.  It was well into day two when they saw the corruption, and the fs was mounted read only.

Please let me know if you need any other data here.

--chris

Comment 7 Dan Yasny 2009-11-04 10:52:03 UTC
I can see the VMs being stopped in vdsm.log:
                                                                                                     
vdsm.log.1.gz:QMon-756509::DEBUG::2009-10-30 12:01:24,121::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: Input/output error                                                                                                                                                                          
vdsm.log.1.gz:QMon-765619::DEBUG::2009-10-30 12:01:24,697::QemuMonitor::229::vds.vmlog.1b555a43-b789-4f0b-9ac9-b0c19d7872c1::# VM is stopped due to disk write error: ide0-hd0: Input/output error                                                                                                                                                                          
vdsm.log.1.gz:QMon-769615::DEBUG::2009-10-30 12:01:26,029::QemuMonitor::229::vds.vmlog.be77e39a-2ce3-427a-b43b-0f63bba0a176::# VM is stopped due to disk write error: virtio0: Input/output error                                                                                                                                                                           
vdsm.log.1.gz:QMon-769615::DEBUG::2009-10-30 12:01:26,029::QemuMonitor::229::vds.vmlog.be77e39a-2ce3-427a-b43b-0f63bba0a176::# VM is stopped due to disk write error: virtio0: Input/output error                                                                                                                                                                           
vdsm.log.1.gz:QMon-769615::DEBUG::2009-10-30 12:01:26,029::QemuMonitor::229::vds.vmlog.be77e39a-2ce3-427a-b43b-0f63bba0a176::# VM is stopped due to disk write error: virtio0: Input/output error                                                                                                                                                                           
vdsm.log.2.gz:QMon-739523::DEBUG::2009-10-30 05:03:37,181::QemuMonitor::229::vds.vmlog.32d39ed6-5a7d-4c6f-9fcc-e8f22e857172::# VM is stopped due to disk write error: virtio0: No space left on device                                                                                                                                                                     
vdsm.log.2.gz:QMon-731858::DEBUG::2009-10-30 05:03:44,418::QemuMonitor::229::vds.vmlog.45937870-c797-4882-af3a-c043b11deadf::# VM is stopped due to disk write error: virtio0: No space left on device                                                                                                                                                                     
vdsm.log.2.gz:QMon-731858::DEBUG::2009-10-30 05:03:44,419::QemuMonitor::229::vds.vmlog.45937870-c797-4882-af3a-c043b11deadf::# VM is stopped due to disk write error: virtio0: No space left on device                                                                                                                                                                     
vdsm.log.2.gz:QMon-731858::DEBUG::2009-10-30 05:03:44,419::QemuMonitor::229::vds.vmlog.45937870-c797-4882-af3a-c043b11deadf::# VM is stopped due to disk write error: virtio0: No space left on device                                                                                                                                                                     
vdsm.log.2.gz:QMon-731858::DEBUG::2009-10-30 05:03:44,420::QemuMonitor::229::vds.vmlog.45937870-c797-4882-af3a-c043b11deadf::# VM is stopped due to disk write error: virtio0: No space left on device                                                                                                                                                                     
vdsm.log.2.gz:QMon-727238::DEBUG::2009-10-30 05:03:44,729::QemuMonitor::229::vds.vmlog.1b555a43-b789-4f0b-9ac9-b0c19d7872c1::# VM is stopped due to disk write error: ide0-hd0: No space left on device                                                                                                                                                                     
vdsm.log.2.gz:QMon-747228::DEBUG::2009-10-30 05:03:50,128::QemuMonitor::229::vds.vmlog.2801d18d-3716-4e27-8f70-9efeabb9c7e3::# VM is stopped due to disk write error: virtio0: No space left on device                                                                                                                                                                     
vdsm.log.2.gz:QMon-747228::DEBUG::2009-10-30 05:03:50,129::QemuMonitor::229::vds.vmlog.2801d18d-3716-4e27-8f70-9efeabb9c7e3::# VM is stopped due to disk write error: virtio0: No space left on device                                                                                                                                                                     
vdsm.log.2.gz:QMon-747228::DEBUG::2009-10-30 05:03:50,130::QemuMonitor::229::vds.vmlog.2801d18d-3716-4e27-8f70-9efeabb9c7e3::# VM is stopped due to disk write error: virtio0: No space left on device                                                                                                                                                                     
vdsm.log.2.gz:QMon-747228::DEBUG::2009-10-30 05:03:50,131::QemuMonitor::229::vds.vmlog.2801d18d-3716-4e27-8f70-9efeabb9c7e3::# VM is stopped due to disk write error: virtio0: No space left on device                                                                                                                                                                     
vdsm.log.2.gz:QMon-751844::DEBUG::2009-10-30 05:19:42,662::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device                                                                                                                                                                     
vdsm.log.2.gz:QMon-752208::DEBUG::2009-10-30 05:26:36,985::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device                                                                                                                                                                     
vdsm.log.2.gz:QMon-752525::DEBUG::2009-10-30 05:32:37,301::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device                                                                                                                                                                     
vdsm.log.2.gz:QMon-752932::DEBUG::2009-10-30 05:40:37,553::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device                                                                                                                                                                     
vdsm.log.2.gz:QMon-753293::DEBUG::2009-10-30 05:47:37,809::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device                                                                                                                                                                     
vdsm.log.2.gz:QMon-753658::DEBUG::2009-10-30 05:54:37,918::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device
vdsm.log.2.gz:QMon-754068::DEBUG::2009-10-30 06:02:38,358::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device
vdsm.log.2.gz:QMon-754473::DEBUG::2009-10-30 06:10:38,591::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device
vdsm.log.2.gz:QMon-754790::DEBUG::2009-10-30 06:16:38,842::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device
vdsm.log.2.gz:QMon-755108::DEBUG::2009-10-30 06:22:39,140::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device
vdsm.log.2.gz:QMon-755468::DEBUG::2009-10-30 06:29:39,289::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device
vdsm.log.2.gz:QMon-755784::DEBUG::2009-10-30 06:35:40,560::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device
vdsm.log.2.gz:QMon-756149::DEBUG::2009-10-30 06:42:39,728::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device
vdsm.log.2.gz:QMon-756509::DEBUG::2009-10-30 06:49:39,736::QemuMonitor::229::vds.vmlog.b1096399-1f74-40a8-898e-9a0520e07b22::# VM is stopped due to disk write error: ide0-hd0: No space left on device
vdsm.log.5.gz:QMon-569::DEBUG::2009-10-29 07:51:53,615::QemuMonitor::229::vds.vmlog.9f382326-7423-4941-a7c8-368dad53fb44::# VM is stopped due to disk write error: virtio0: No space left on device
vdsm.log.5.gz:QMon-611::DEBUG::2009-10-29 07:52:02,340::QemuMonitor::229::vds.vmlog.45937870-c797-4882-af3a-c043b11deadf::# VM is stopped due to disk write error: virtio0: No space left on device

Comment 9 Dor Laor 2009-11-05 04:00:59 UTC
Apararntly, seems like Uli's analysis is correct and we get timeout errors from NFS. Since currently we only translate them into no_space, this what happens for IDE. For virtio, from the above code, seems like write returns error while read is ignored. It's not good either.

My recommendation is to handle all errors as ENOSPAC for this release and let mgmt resume the guest (although it does not relate to lack of storage).
We need to make sure that read errors are also handled.

Comment 26 Miya Chen 2009-11-18 09:01:33 UTC
This bug can be reproduced in kvm-105.4.10.

step:
1. create a local nfs server in host, then
mount localhost:/root/test-nfs /mnt -o rw,soft,timeo=1,retrans=0
cd /mnt
qemu-img create test-virtio.qcow2 -f qcow2 10G

2. start guest:
/usr/libexec/qemu-kvm -rtc-td-hack -no-hpet -usbdevice tablet -drive file=rhel5u4-64-virtio.qcow2,if=virtio,boot=on -cpu qemu64,+sse2 -smp 2 -m 2G -net nic,macaddr=20:20:20:33:34:66,model=virtio,vlan=1 -net tap,script=/etc/qemu-ifup,vlan=1 -drive file=/mnt/test-virtio.qcow2,format=qcow2,cache=off,werror=stop,if=virtio -vnc :10 -name rhel54_64 -monitor stdio

3. in guest:
fdisk /dev/hdb
mkfs.ext3 /dev/hdb1
mount /dev/hdb1 /mnt
cp -r /usr/share /mnt

4. then in host:
gdb -p 31766
(gdb) break virtio_blk_rw_complete
(gdb) continue

5. in host:
service nfs stop

6. in host:
service nfs start
7.
(gdb) quit


Actual result:
guest dmesg:
end_request: I/O error, dev vdb, sector 5292199
Buffer I/O error on device vdb1, logical block 661517
lost page write due to I/O error on vdb1
Buffer I/O error on device vdb1, logical block 661518
lost page write due to I/O error on vdb1
Buffer I/O error on device vdb1, logical block 661519
lost page write due to I/O error on vdb1
Buffer I/O error on device vdb1, logical block 661520
lost page write due to I/O error on vdb1
Buffer I/O error on device vdb1, logical block 661521
lost page write due to I/O error on vdb1
Buffer I/O error on device vdb1, logical block 661522
lost page write due to I/O error on vdb1
Buffer I/O error on device vdb1, logical block 661523
lost page write due to I/O error on vdb1
Buffer I/O error on device vdb1, logical block 661524
lost page write due to I/O error on vdb1
Buffer I/O error on device vdb1, logical block 661525
lost page write due to I/O error on vdb1
Buffer I/O error on device vdb1, logical block 661526
lost page write due to I/O error on vdb1
Aborting journal on device vdb1.
ext3_abort called.
EXT3-fs error (device vdb1): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
ext3_abort called.
EXT3-fs error (device vdb1): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only

Comment 30 Miya Chen 2009-12-01 06:52:24 UTC
Test with build kvm-83-136 again, VM can stop on read errors.

steps:

1. create a local nfs server in host, then
mount localhost:/root/test-nfs /mnt -o rw,soft,timeo=1,retrans=0
cd /mnt
qemu-img create test-533390.qcow2 -f qcow2 10G


2. start guest:
/usr/libexec/qemu-kvm -rtc-td-hack -no-hpet -usbdevice tablet -cpu qemu64,+sse2
-drive
file=RHEL-Server-5.4-64-virtio.qcow2,if=virtio,boot=on,format=qcow2,cache=off,werror=stop
-smp 2 -m 2G -vnc :1 -net nic,macaddr=20:20:20:11:12:56,model=virtio,vlan=0
-net tap,script=/etc/qemu-ifup,vlan=0 -monitor stdio -drive
file=/mnt/test-533390.qcow2,if=virtio,format=qcow2,cache=off,werror=stop


3. in guest:
fdisk /dev/vdb
mkfs.ext3 /dev/vdb1
dd if=/dev/vdb1 of=/dev/null


4. in host:
service nfs stop

5. In host dmesg:
nfs: server localhost not responding, timed out


6. In qemu monitor
(qemu)info status
VM status: paused  

7. in host:
service nfs start

8. In qemu monitor
(qemu)c 

Guest works fine after restore.

Comment 35 errata-xmlrpc 2010-03-30 07:52:23 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0271.html

Comment 36 Red Hat Bugzilla 2023-09-14 01:18:33 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.