Bug 835019

Summary: trying to mount an empty 1K partition causes a hang in ext4 driver, using 100% CPU
Product: [Fedora] Fedora Reporter: Richard W.M. Jones <rjones>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, mbooth, rjones, sdake, virt-maint, walkerrichardj
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 834896
: 835084 (view as bug list) Environment:
Last Closed: 2012-08-28 11:05:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 834896    
Attachments:
Description Flags
test1.img.xz none

Description Richard W.M. Jones 2012-06-25 09:18:42 UTC
+++ This bug was initially created as a clone of Bug #834896 +++

If you try to mount an extended partition directly,
previously it would give an error (which seems like the
correct thing to do, since an extended partition is never
a filesystem, and so cannot be mounted).

However with recent kernels it goes into an infinite loop
using 100% of CPU.

Here is a simple reproducer using libguestfs.

guestfish -x <<EOF
  sparse test1.img 100M
  run
  part-init /dev/sda mbr
  part-add /dev/sda p 32 127
  part-add /dev/sda e 128 -32
  part-add /dev/sda l 140 499
  part-add /dev/sda l 501 -64
  part-list /dev/sda
  mount /dev/sda2 /
EOF

It hangs at the last (mount) line where it's trying to mount
the extended partition.

You can get additional debug information by adding the '-v'
flag to the guestfish command line.

guestfsd is just executing this command:

  mount -o "" /dev/vda2 /sysroot/

So it appears to be a kernel bug.

Affected systems:

Distro      Kernel                   Affected?

Fedora 16   3.1.0-7.fc16.x86_64      No
Fedora 16   3.4.2-1.fc16.x86_64      Yes
Fedora 17   3.4.0-1.fc17.x86_64      Yes
Rawhide     3.5.0-0.rc2.git0.1.fc18.x86_64  Yes
Rawhide     3.5.0-0.rc3.git0.2.fc18.x86_64  Yes
RHEL 6      2.6.32-221.el6.x86_64    No

So it appears to be a bug that has been introduced to
the kernel between 3.1.0 and 3.4.2 (unfortunately rather
a large range of versions!)

Comment 1 Richard W.M. Jones 2012-06-25 09:32:27 UTC
Created attachment 594138 [details]
test1.img.xz

Here is a way to reproduce this without libguestfs, using
a virtual machine.

Take the attached disk image and uncompress it.

Then add it as an extra disk to a virtual machine.

Boot the virtual machine, and inside run the following
command (assumes that you added the disk image as /dev/vdb):

  mkdir /tmp/mnt
  mount -o '' /dev/vdb2 /tmp/mnt

The mount command will spin in a loop using 100% of CPU,
apparently forever (or at least for many minutes).

Also the mount command is unkillable, even with -9.

Comment 2 Richard W.M. Jones 2012-06-25 09:57:38 UTC
Possibly this bug?
http://www.spinics.net/lists/linux-ext4/msg32567.html

Comment 3 Richard W.M. Jones 2012-06-25 11:25:41 UTC
Stack trace from 'mount' command (captured using sysrq + t):

[    8.073005] mount           R  running task        0   134    133 0x00000000
[    8.073005]  ffff88001d6e3aa8 0000000000000082 ffff88001d768000 ffff88001d6e3fd8
[    8.073005]  ffff88001d6e3fd8 ffff88001d6e3fd8 ffff88001d769700 ffff88001d768000
[    8.073005]  0000000000000000 ffff88001d6e2000 0000000000000000 ffff88001dc26c60
[    8.073005] Call Trace:
[    8.073005]  [<ffffffff8108671a>] __cond_resched+0x2a/0x40
[    8.073005]  [<ffffffff815ef820>] _cond_resched+0x30/0x40
[    8.073005]  [<ffffffff8111d2eb>] find_lock_page+0x3b/0x80
[    8.073005]  [<ffffffff8111d9df>] find_or_create_page+0x3f/0xb0
[    8.073005]  [<ffffffff811acf12>] __getblk+0xf2/0x2a0
[    8.073005]  [<ffffffff811ad113>] __bread+0x13/0xb0
[    8.073005]  [<ffffffff8121b4e7>] ext4_fill_super+0x207/0x2a50
[    8.073005]  [<ffffffff8118055b>] mount_bdev+0x1cb/0x210
[    8.073005]  [<ffffffff8121b2e0>] ? ext4_remount+0x5d0/0x5d0
[    8.073005]  [<ffffffff8116b611>] ? __kmalloc_track_caller+0x51/0x180
[    8.073005]  [<ffffffff8120a7f5>] ext4_mount+0x15/0x20
[    8.073005]  [<ffffffff81181063>] mount_fs+0x43/0x1b0
[    8.073005]  [<ffffffff8113de80>] ? __alloc_percpu+0x10/0x20
[    8.073005]  [<ffffffff81199bc7>] vfs_kern_mount+0x67/0xf0
[    8.073005]  [<ffffffff8119a6e4>] do_kern_mount+0x54/0x110
[    8.073005]  [<ffffffff8119bf4a>] do_mount+0x26a/0x840
[    8.073005]  [<ffffffff8113832b>] ? strndup_user+0x5b/0x80
[    8.073005]  [<ffffffff8119c65d>] sys_mount+0x8d/0xe0
[    8.073005]  [<ffffffff815f8ae9>] system_call_fastpath+0x16/0x1b

Comment 4 Richard W.M. Jones 2012-06-25 11:31:00 UTC
Here's an even simpler way to reproduce the bug.  Simply
create a 1024 byte device (empty) and try to mount it:

guestfish -x -v <<EOF                                                           
  sparse test1.img 1024                                                         
  run                                                                           
  mount /dev/sda /                                                              
EOF                                                                             

The stack trace from this one is substantially the same:

[    7.476010] mount           R  running task        0   109    108 0x00000000
[    7.476010]  ffff88001d783aa8 0000000000000082 ffff88001d6cc500 ffff88001d783fd8
[    7.476010]  ffff88001d783fd8 ffff88001d783fd8 ffff88001d430000 ffff88001d6cc500
[    7.476010]  ffffea0000722ddc ffff88001d782000 0000000000000000 ffff88001dc248a0
[    7.476010] Call Trace:
[    7.476010]  [<ffffffff8108671a>] __cond_resched+0x2a/0x40
[    7.476010]  [<ffffffff8111d2f2>] ? find_lock_page+0x42/0x80
[    7.476010]  [<ffffffff815ef820>] _cond_resched+0x30/0x40
[    7.476010]  [<ffffffff8111d2eb>] find_lock_page+0x3b/0x80
[    7.476010]  [<ffffffff8111d9df>] find_or_create_page+0x3f/0xb0
[    7.476010]  [<ffffffff811acf12>] __getblk+0xf2/0x2a0
[    7.476010]  [<ffffffff811ad113>] __bread+0x13/0xb0
[    7.476010]  [<ffffffff8121b4e7>] ext4_fill_super+0x207/0x2a50
[    7.476010]  [<ffffffff8118055b>] mount_bdev+0x1cb/0x210
[    7.476010]  [<ffffffff8121b2e0>] ? ext4_remount+0x5d0/0x5d0
[    7.476010]  [<ffffffff8116b611>] ? __kmalloc_track_caller+0x51/0x180
[    7.476010]  [<ffffffff8120a7f5>] ext4_mount+0x15/0x20
[    7.476010]  [<ffffffff81181063>] mount_fs+0x43/0x1b0
[    7.476010]  [<ffffffff8113de80>] ? __alloc_percpu+0x10/0x20
[    7.476010]  [<ffffffff81199bc7>] vfs_kern_mount+0x67/0xf0
[    7.476010]  [<ffffffff8119a6e4>] do_kern_mount+0x54/0x110
[    7.476010]  [<ffffffff8119bf4a>] do_mount+0x26a/0x840
[    7.476010]  [<ffffffff8113832b>] ? strndup_user+0x5b/0x80
[    7.476010]  [<ffffffff8119c65d>] sys_mount+0x8d/0xe0
[    7.476010]  [<ffffffff815f8ae9>] system_call_fastpath+0x16/0x1b

Comment 5 Richard W.M. Jones 2012-06-25 19:38:08 UTC
Thanks to Jeff Moyer who suggested the following patch:

https://lkml.org/lkml/2012/6/25/306

which fixes this bug.

Comment 6 Josh Boyer 2012-06-25 20:42:45 UTC
*** Bug 835084 has been marked as a duplicate of this bug. ***

Comment 7 Josh Boyer 2012-06-26 15:38:58 UTC
Patch committed to Fedora git.  Will be in the next build.

Comment 8 Fedora Update System 2012-06-27 00:08:53 UTC
kernel-3.4.4-3.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/kernel-3.4.4-3.fc17

Comment 9 Fedora Update System 2012-06-27 00:11:25 UTC
kernel-3.4.4-3.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/kernel-3.4.4-3.fc16

Comment 10 Fedora Update System 2012-06-28 03:28:00 UTC
Package kernel-3.4.4-3.fc17:
* should fix your issue,
* was pushed to the Fedora 17 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.4.4-3.fc17'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-9988/kernel-3.4.4-3.fc17
then log in and leave karma (feedback).

Comment 11 Fedora Update System 2012-06-30 21:59:33 UTC
kernel-3.4.4-3.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 12 Fedora Update System 2012-07-05 23:50:32 UTC
kernel-3.4.4-4.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/kernel-3.4.4-4.fc16

Comment 13 Fedora Update System 2012-07-08 20:51:42 UTC
kernel-3.4.4-4.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 14 Richard W.M. Jones 2012-08-22 10:09:56 UTC
Reopening on the basis of this email:

https://lkml.org/lkml/2012/8/21/692
"[PATCH] block: replace __getblk_slow misfix by grow_dev_page fix"

I am now testing the alternate fix proposed there.

Comment 15 Richard W.M. Jones 2012-08-28 11:05:13 UTC
The first patch (comment 14) caused a regression.

A second version of the patch went upstream and is already
included in kernel-3.6.0-0.rc3.git2.1.fc18.x86_64.rpm.
I wasn't able to test this until now.  However I have
just tested it, and the regression has gone.  Therefore
I am closing this bug again.