Bug 684253

Summary: EXT4-fs error and kernel oops in VMs hosted by VMware ESXi
Product: [Fedora] Fedora Reporter: Francis.Montagnac
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 14CC: colyli, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-06-25 15:30:43 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Francis.Montagnac 2011-03-11 15:31:49 UTC
We have 33 VMs hosted by a VMware ESXi cluster. They run fine for
around six months running Fedora-12.

They begin to be unstable since I upgraded them (using yum) to
Fedora-14.

Two other VMs installed from scratch start to present the same
symptoms, thus I don't think it's relied on the upgrade process.

We have more than hundred stations and laptop installed the same way
in Fedora-14 that do not have this problem.

The symptoms may appear after a rather long time, say between one day
or a week.

Example 1: loop of kernel oops 

  The more serious one: the VM loops doing a kernel oops, and is not
  accessible any more.

  We have to force reboot it. A manual fsck is then sometimes needed.

  The oops is most often showing calls to system_call_fastpath and
  ext4_file_write (or nfs3_decode_dirent), but not always.

Example 2: uptime and top segfault in libproc

    [601941.287198] uptime[4348]: segfault at 42410073 ip \
    00000035fbe0a001 sp 00007fff569236e0 error 6 \
    in libproc-3.2.8.so[35fbe00000+e000]

  rpm -V confirms a corruption in libproc:

    rpm -Vf /lib64/libproc-3.2.8.so 
    prelink: /lib64/libproc-3.2.8.so: prelinked file was modified
    S.?......    /lib64/libproc-3.2.8.so

  After rebooting this is solved.

Example 3: /var/log/messages showing EXT4-fs error

  Like: 

    EXT4-fs error (device sda2): ext4_lookup: inode #923158: \
      (comm find) deleted inode referenced: 923185

    EXT4-fs error (device sda2): ext4_ext_check_inode: inode #209183: \
      (comm find) bad header/extent: invalid magic - magic 0, entries 0, \
      max 0(0), depth 0(0)

  We walk the filesystem with find every night.

Any advice to investigate more on this welcome.

I plan to reconfigure half of those VMs to use EXT3 instead of
EXT4. Do you think it's a valuable test?

Thanks.

Comment 1 colyli 2011-03-17 16:22:14 UTC
For Example 3, I observed on one of my machine too.

In my environment, the file with inode number is a broken directory, which is deleted but still appears in parent directory.

Comment 2 Francis.Montagnac 2011-06-25 15:30:43 UTC
> I plan to reconfigure half of those VMs to use EXT3 instead of EXT4.

I did that and noticed an "EXT3-fs error in htree_dirblock_to_tree:
bad entry in directory" once on one VM: that was not specific to EXT4.

I rebooted all of them at the begining of May after a full
"yum update", including the 2.6.35.12-90.fc14 kernel and the problem
seems solved.

You can close this bug.