Bug 684253 - EXT4-fs error and kernel oops in VMs hosted by VMware ESXi
Summary: EXT4-fs error and kernel oops in VMs hosted by VMware ESXi
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 14
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-03-11 15:31 UTC by Francis.Montagnac
Modified: 2011-06-25 15:30 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-06-25 15:30:43 UTC
Type: ---


Attachments (Terms of Use)

Description Francis.Montagnac 2011-03-11 15:31:49 UTC
We have 33 VMs hosted by a VMware ESXi cluster. They run fine for
around six months running Fedora-12.

They begin to be unstable since I upgraded them (using yum) to
Fedora-14.

Two other VMs installed from scratch start to present the same
symptoms, thus I don't think it's relied on the upgrade process.

We have more than hundred stations and laptop installed the same way
in Fedora-14 that do not have this problem.

The symptoms may appear after a rather long time, say between one day
or a week.

Example 1: loop of kernel oops 

  The more serious one: the VM loops doing a kernel oops, and is not
  accessible any more.

  We have to force reboot it. A manual fsck is then sometimes needed.

  The oops is most often showing calls to system_call_fastpath and
  ext4_file_write (or nfs3_decode_dirent), but not always.

Example 2: uptime and top segfault in libproc

    [601941.287198] uptime[4348]: segfault at 42410073 ip \
    00000035fbe0a001 sp 00007fff569236e0 error 6 \
    in libproc-3.2.8.so[35fbe00000+e000]

  rpm -V confirms a corruption in libproc:

    rpm -Vf /lib64/libproc-3.2.8.so 
    prelink: /lib64/libproc-3.2.8.so: prelinked file was modified
    S.?......    /lib64/libproc-3.2.8.so

  After rebooting this is solved.

Example 3: /var/log/messages showing EXT4-fs error

  Like: 

    EXT4-fs error (device sda2): ext4_lookup: inode #923158: \
      (comm find) deleted inode referenced: 923185

    EXT4-fs error (device sda2): ext4_ext_check_inode: inode #209183: \
      (comm find) bad header/extent: invalid magic - magic 0, entries 0, \
      max 0(0), depth 0(0)

  We walk the filesystem with find every night.

Any advice to investigate more on this welcome.

I plan to reconfigure half of those VMs to use EXT3 instead of
EXT4. Do you think it's a valuable test?

Thanks.

Comment 1 colyli 2011-03-17 16:22:14 UTC
For Example 3, I observed on one of my machine too.

In my environment, the file with inode number is a broken directory, which is deleted but still appears in parent directory.

Comment 2 Francis.Montagnac 2011-06-25 15:30:43 UTC
> I plan to reconfigure half of those VMs to use EXT3 instead of EXT4.

I did that and noticed an "EXT3-fs error in htree_dirblock_to_tree:
bad entry in directory" once on one VM: that was not specific to EXT4.

I rebooted all of them at the begining of May after a full
"yum update", including the 2.6.35.12-90.fc14 kernel and the problem
seems solved.

You can close this bug.


Note You need to log in before you can comment on or make changes to this bug.