Description of problem: ext4 mount can trigger the "NMI Watchdog detected LOCKUP on CPU x" kernel panic. Here's the kernel output / call trace: EXT4-fs (sdb1): VFS: Can't find ext4 filesystem NMI Watchdog detected LOCKUP on CPU 6 CPU 6 Modules linked in: tun netconsole nfsd exportfs auth_rpcgss nfs fscache nfs_acl ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc ib_iser libiscsi2 scsi_transport_iscsi2 sc si_transport_iscsi ib_srp rds ib_sdp ib_ipoib ipoib_helper ipv6 xfrm_nalgo crypto_api rdma_ucm rdma_cm ib_ucm ib_uverbs ib_umad ib_cm iw_cm ib_addr ib_sa ext3 jbd loop dm_m ultipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev i2c_i801 i5400_edac ib_m thca i2c_core edac_mc ib_mad pcspkr sg serio_raw ib_core shpchp e1000e dm_snapshot dm_zero dm_mirror dm_log dm_mod ahci libata sd_mod scsi_mod ext4 jbd2 crc16 uhci_hcd ohci _hcd ehci_hcd Pid: 4807, comm: mount Not tainted 2.6.18-194.8.1.el5 #1 RIP: 0010:[<ffffffff80154456>] [<ffffffff80154456>] __list_add+0x14/0x68 RSP: 0018:ffff8102133a7a38 EFLAGS: 00000046 RAX: ffff810107ace9c0 RBX: ffff810107ace9c0 RCX: ffffffff800e3700 RDX: ffff810107ace9c0 RSI: ffff810107ace9c0 RDI: ffff81022dc14880 RBP: ffff810107ace9c0 R08: ffff81022fc251c0 R09: ffff810107acc460 R10: 0000000000000000 R11: 0000000000000000 R12: ffff81022dc14880 R13: ffff810107ace9c0 R14: 0000000000000004 R15: ffff810107ae0540 FS: 00002b6b3e4545c0(0000) GS:ffff81022fca7bc0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000000001a62a000 CR3: 000000021ec97000 CR4: 00000000000006e0 Process mount (pid: 4807, threadinfo ffff8102133a6000, task ffff81022855e860) Stack: 0000000000000246 ffff81022dc14880 ffff81022fc251c0 ffffffff8005c05b 000000d0133a7bc8 0000000000000246 00000000000000d0 ffff810107ae0540 0000000000000000 ffff810219f57000 0000000000000001 ffffffff800dbc73 Call Trace: [<ffffffff8005c05b>] cache_alloc_refill+0x106/0x186 [<ffffffff800dbc73>] kmem_cache_zalloc+0x6f/0x94 [<ffffffff8806556f>] :ext4:ext4_fill_super+0xd5/0x20a5 [<ffffffff8806549a>] :ext4:ext4_fill_super+0x0/0x20a5 [<ffffffff8015332d>] snprintf+0x44/0x4c [<ffffffff800645ab>] __down_write_nested+0x12/0x92 [<ffffffff8012c0f9>] selinux_sb_alloc_security+0x3e/0x82 [<ffffffff800ecf51>] get_filesystem+0x12/0x3b [<ffffffff800e3720>] test_bdev_super+0x0/0xd [<ffffffff8806549a>] :ext4:ext4_fill_super+0x0/0x20a5 [<ffffffff800e46df>] get_sb_bdev+0x10a/0x16c [<ffffffff8012ccfc>] selinux_sb_copy_data+0x1a1/0x1c5 [<ffffffff800e407c>] vfs_kern_mount+0x93/0x11a [<ffffffff800e4145>] do_kern_mount+0x36/0x4d [<ffffffff800ee880>] do_mount+0x6a9/0x719 [<ffffffff800090d2>] __handle_mm_fault+0x96f/0xfaa [<ffffffff8002c920>] mntput_no_expire+0x19/0x89 [<ffffffff8000a726>] __link_path_walk+0xf1a/0xf5b [<ffffffff800220e0>] __up_read+0x19/0x7f [<ffffffff80066b88>] do_page_fault+0x4fe/0x874 [<ffffffff8002c920>] mntput_no_expire+0x19/0x89 [<ffffffff8000ea20>] link_path_walk+0xa6/0xb2 [<ffffffff800cc1f1>] zone_statistics+0x3e/0x6d [<ffffffff8000f2aa>] __alloc_pages+0x78/0x308 [<ffffffff8004c38f>] sys_mount+0x8a/0xcd [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Code: 74 18 48 c7 c7 74 bb 2b 80 31 c0 e8 ab df f3 ff 0f 0b 68 26 Kernel panic - not syncing: nmi watchdog BUG: warning at kernel/panic.c:137/panic() (Not tainted) Version-Release number of selected component (if applicable): Bad: * 2.6.18-194.8.1.el5 (RHEL5 Update 5) * 2.6.18-194.3.1.el5 (") * 2.6.18-194.el5 (") Good: * 2.6.18-164.15.1.el5 (latest RHEL5 Update 4 kernel) How reproducible: The problem is reproducible every time. Steps to Reproduce: 1. Clear partition table of a hdisk, create a small 1 GB partition 2. Zero the partition: "dd if=/dev/zero of=/dev/sdb1 bs=1M" 3. Try to mount the partition twice(!) with "mount -4 ext4 /dev/sdb1 /mnt" Actual results: The first mount will fail as expected. But the second will trigger the NMI watchdog. Expected results: Two failed mounts. Additional info: Yes, I've triggered this in real-life: A HPC cluster post-installation script did probe mounts to check if there are already existing partitions on the 2nd disk). (It cost me quite some time to break it down to this simple testcase because the original script did much more.)
Thanks, I can reproduce this on rhel5 but not rhel6 or upstream, I'll take a look at it. -Eric
Ok this is probably due to a stray kfree(&sbi->s_blockgroup_lock) in the error path of mount; upstream that matches an allocation, but in rhel5.5 it was a mistaken backport. Thanks to Johann @ lustre for pointing that out to me ....
I fixed this by backporting commit 705895b61133ef43d106fe6a6bbdb2eec923867e Author: Pekka Enberg <penberg.fi> Date: Sun Feb 15 18:07:52 2009 -0500 ext4: allocate ->s_blockgroup_lock separately rather than by removing the extraneous kfree. -Eric
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
*** Bug 594446 has been marked as a duplicate of this bug. ***
in kernel-2.6.18-219.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Verified in kernel-2.6.18-219.el5. The mount failed all the times as expected. With kernel-2.6.18-219.el5, the bug was hit with second attempt to mount: NMI Watchdog detected LOCKUP on CPU 1 Call Trace: [<ffffffff8005c6b4>] cache_alloc_refill+0xf1/0x186 [<ffffffff800dc9e3>] kmem_cache_zalloc+0x6f/0x94 [<ffffffff8851d56f>] :ext4:ext4_fill_super+0xd5/0x20a5 [<ffffffff8851d49a>] :ext4:ext4_fill_super+0x0/0x20a5 [<ffffffff80153cb1>] snprintf+0x44/0x4c [<ffffffff800655ab>] __down_write_nested+0x12/0x92 [<ffffffff8012cb3a>] selinux_sb_alloc_security+0x3e/0x82 [<ffffffff800ed9be>] get_filesystem+0x12/0x3b [<ffffffff800e4490>] test_bdev_super+0x0/0xd [<ffffffff8851d49a>] :ext4:ext4_fill_super+0x0/0x20a5 [<ffffffff800e544f>] get_sb_bdev+0x10a/0x16c [<ffffffff8012d73d>] selinux_sb_copy_data+0x1a1/0x1c5 [<ffffffff800e4dec>] vfs_kern_mount+0x93/0x11a [<ffffffff800e4eb5>] do_kern_mount+0x36/0x4d [<ffffffff800ef2ed>] do_mount+0x6a9/0x719 [<ffffffff80009101>] __handle_mm_fault+0x96f/0xfaa [<ffffffff8002cd2c>] mntput_no_expire+0x19/0x89 [<ffffffff8000a759>] __link_path_walk+0xf1e/0xf42 [<ffffffff80022127>] __up_read+0x19/0x7f [<ffffffff80067b88>] do_page_fault+0x4fe/0x874 [<ffffffff8002cd2c>] mntput_no_expire+0x19/0x89 [<ffffffff8000ea75>] link_path_walk+0xa6/0xb2 [<ffffffff800cd378>] zone_statistics+0x3e/0x6d [<ffffffff8000f2ff>] __alloc_pages+0x78/0x308 [<ffffffff8004c9fd>] sys_mount+0x8a/0xcd [<ffffffff8005e28d>] tracesys+0xd5/0xe0
Hi, still unfixed in 2.6.18-194.26.1.el5? I have a similar crash here: EXT4-fs (dm-1): Unrecognized mount option "uid=100" or missing value NMI Watchdog detected LOCKUP on CPU 1 CPU 1 Modules linked in: mptctl mptbase ipmi_watchdog ipmi_devintf ipmi_si ipmi_msghandler ipv6 xfrm_nalgo crypto_api ext4 jbd2 crc16 dm_mirror dm_round_robin dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sr_mod cdrom hpilo serio_raw pcspkr sg bnx2 dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache qla2xxx scsi_transport_fc ata_piix libata shpchp cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 3501, comm: cmaidad Not tainted 2.6.18-194.26.1.el5 #1 RIP: 0010:[<ffffffff801543fb>] [<ffffffff801543fb>] list_del+0xb/0x71 RSP: 0018:ffff810191ebfc38 EFLAGS: 00003082 RAX: ffff810105b4b9c0 RBX: ffff81019d04e4c0 RCX: 0000000000000000 RDX: ffff81019d04e4c0 RSI: ffff810105b4b9c0 RDI: ffff81019d04e4c0 RBP: ffff81019d04e4c0 R08: ffff81019ff11cc0 R09: ffff81019fffd460 R10: 0000000000000000 R11: 0000000000000000 R12: ffff81019ff11cc0 R13: ffff810105b4b9c0 R14: 0000000000000004 R15: ffff81019ffe1540 FS: 0000000000000000(0000) GS:ffff81019ff11840(0063) knlGS:00000000f7decac0 CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b CR2: 00002b78084260a0 CR3: 000000019183f000 CR4: 00000000000006e0 Process cmaidad (pid: 3501, threadinfo ffff810191ebe000, task ffff810191f91100) Stack: ffff81019ffe1540 ffffffff8005c130 000000d000003286 ffff81019ffe1540 0000000000003246 00000000000000d0 ffff81019e550000 00000000ff9b57f0 00000000ff9b589c ffffffff800dbbc3 00000000ff9b57f0 0000000000000000 Call Trace: [<ffffffff8005c130>] cache_alloc_refill+0xf1/0x186 [<ffffffff800dbbc3>] __kmalloc+0x95/0x9f [<ffffffff880bad04>] :cciss:cciss_ioctl+0x50a/0xc58 [<ffffffff8000cf57>] do_lookup+0x65/0x1e6 [<ffffffff8000d47a>] dput+0x2c/0x114 [<ffffffff80057e46>] kobject_get+0x12/0x17 [<ffffffff8005ab33>] exact_lock+0xc/0x14 [<ffffffff880bb47c>] :cciss:do_ioctl+0x2a/0x39 [<ffffffff880bb69f>] :cciss:cciss_compat_ioctl+0x214/0x249 [<ffffffff800e5952>] blkdev_open+0x0/0x4f [<ffffffff800e5975>] blkdev_open+0x23/0x4f [<ffffffff80146b1c>] compat_blkdev_ioctl+0x4c/0x5f [<ffffffff800fb0b8>] compat_sys_ioctl+0xc5/0x2b2 [<ffffffff8006149d>] sysenter_do_call+0x1e/0x76 Code: 48 39 fa 74 1b 48 89 fe 31 c0 48 c7 c7 56 bb 2b 80 e8 0c e1 Kernel panic - not syncing: nmi watchdog <0>Rebooting in 60 seconds..BUG: warning at kernel/panic.c:113/panic() (Not tainted) Call Trace: <NMI> [<ffffffff80091c1f>] panic+0x146/0x1eb [<ffffffff8006bad1>] _show_stack+0xdb/0xea [<ffffffff8006bbc4>] show_registers+0xe4/0x100 [<ffffffff800652c5>] die_nmi+0x66/0xa3 [<ffffffff80065a0b>] nmi_watchdog_tick+0x157/0x1d3 [<ffffffff80065629>] default_do_nmi+0x81/0x225 [<ffffffff80065896>] do_nmi+0x43/0x61 [<ffffffff80064eef>] nmi+0x7f/0x88 [<ffffffff801543fb>] list_del+0xb/0x71 <<EOE>> [<ffffffff8005c130>] cache_alloc_refill+0xf1/0x186 [<ffffffff800dbbc3>] __kmalloc+0x95/0x9f [<ffffffff880bad04>] :cciss:cciss_ioctl+0x50a/0xc58 [<ffffffff8000cf57>] do_lookup+0x65/0x1e6 [<ffffffff8000d47a>] dput+0x2c/0x114 [<ffffffff80057e46>] kobject_get+0x12/0x17 [<ffffffff8005ab33>] exact_lock+0xc/0x14 [<ffffffff880bb47c>] :cciss:do_ioctl+0x2a/0x39 [<ffffffff880bb69f>] :cciss:cciss_compat_ioctl+0x214/0x249 [<ffffffff800e5952>] blkdev_open+0x0/0x4f [<ffffffff800e5975>] blkdev_open+0x23/0x4f [<ffffffff80146b1c>] compat_blkdev_ioctl+0x4c/0x5f [<ffffffff800fb0b8>] compat_sys_ioctl+0xc5/0x2b2 [<ffffffff8006149d>] sysenter_do_call+0x1e/0x76 BUG: warning at drivers/input/serio/i8042.c:846/i8042_panic_blink() (Not tainted) Call Trace: <NMI> [<ffffffff8020b0df>] i8042_panic_blink+0x112/0x2a5 [<ffffffff80091bc5>] panic+0xec/0x1eb [<ffffffff8006bad1>] _show_stack+0xdb/0xea [<ffffffff8006bbc4>] show_registers+0xe4/0x100 [<ffffffff800652c5>] die_nmi+0x66/0xa3 [<ffffffff80065a0b>] nmi_watchdog_tick+0x157/0x1d3 [<ffffffff80065629>] default_do_nmi+0x81/0x225 [<ffffffff80065896>] do_nmi+0x43/0x61 [<ffffffff80064eef>] nmi+0x7f/0x88 [<ffffffff801543fb>] list_del+0xb/0x71 <<EOE>> [<ffffffff8005c130>] cache_alloc_refill+0xf1/0x186 [<ffffffff800dbbc3>] __kmalloc+0x95/0x9f [<ffffffff880bad04>] :cciss:cciss_ioctl+0x50a/0xc58 [<ffffffff8000cf57>] do_lookup+0x65/0x1e6 [<ffffffff8000d47a>] dput+0x2c/0x114 [<ffffffff80057e46>] kobject_get+0x12/0x17 [<ffffffff8005ab33>] exact_lock+0xc/0x14 [<ffffffff880bb47c>] :cciss:do_ioctl+0x2a/0x39 [<ffffffff880bb69f>] :cciss:cciss_compat_ioctl+0x214/0x249 [<ffffffff800e5952>] blkdev_open+0x0/0x4f [<ffffffff800e5975>] blkdev_open+0x23/0x4f [<ffffffff80146b1c>] compat_blkdev_ioctl+0x4c/0x5f [<ffffffff800fb0b8>] compat_sys_ioctl+0xc5/0x2b2 [<ffffffff8006149d>] sysenter_do_call+0x1e/0x76 BUG: warning at drivers/input/serio/i8042.c:849/i8042_panic_blink() (Not tainted) Call Trace: <NMI> [<ffffffff8020b1c8>] i8042_panic_blink+0x1fb/0x2a5 [<ffffffff80091bc5>] panic+0xec/0x1eb [<ffffffff8006bad1>] _show_stack+0xdb/0xea [<ffffffff8006bbc4>] show_registers+0xe4/0x100 [<ffffffff800652c5>] die_nmi+0x66/0xa3 [<ffffffff80065a0b>] nmi_watchdog_tick+0x157/0x1d3 [<ffffffff80065629>] default_do_nmi+0x81/0x225 [<ffffffff80065896>] do_nmi+0x43/0x61 [<ffffffff80064eef>] nmi+0x7f/0x88 [<ffffffff801543fb>] list_del+0xb/0x71 <<EOE>> [<ffffffff8005c130>] cache_alloc_refill+0xf1/0x186 [<ffffffff800dbbc3>] __kmalloc+0x95/0x9f [<ffffffff880bad04>] :cciss:cciss_ioctl+0x50a/0xc58 [<ffffffff8000cf57>] do_lookup+0x65/0x1e6 [<ffffffff8000d47a>] dput+0x2c/0x114 [<ffffffff80057e46>] kobject_get+0x12/0x17 [<ffffffff8005ab33>] exact_lock+0xc/0x14 [<ffffffff880bb47c>] :cciss:do_ioctl+0x2a/0x39 [<ffffffff880bb69f>] :cciss:cciss_compat_ioctl+0x214/0x249 [<ffffffff800e5952>] blkdev_open+0x0/0x4f [<ffffffff800e5975>] blkdev_open+0x23/0x4f [<ffffffff80146b1c>] compat_blkdev_ioctl+0x4c/0x5f [<ffffffff800fb0b8>] compat_sys_ioctl+0xc5/0x2b2 [<ffffffff8006149d>] sysenter_do_call+0x1e/0x76 BUG: warning at drivers/input/serio/i8042.c:851/i8042_panic_blink() (Not tainted) Call Trace: <NMI> [<ffffffff8020b245>] i8042_panic_blink+0x278/0x2a5 [<ffffffff80091bc5>] panic+0xec/0x1eb [<ffffffff8006bad1>] _show_stack+0xdb/0xea [<ffffffff8006bbc4>] show_registers+0xe4/0x100 [<ffffffff800652c5>] die_nmi+0x66/0xa3 [<ffffffff80065a0b>] nmi_watchdog_tick+0x157/0x1d3 Call Trace: <NMI> [<ffffffff8020b1c8>] i8042_panic_blink+0x1fb/0x2a5 [<ffffffff80091bc5>] panic+0xec/0x1eb [<ffffffff8006bad1>] _show_stack+0xdb/0xea [<ffffffff8006bbc4>] show_registers+0xe4/0x100 [<ffffffff800652c5>] die_nmi+0x66/0xa3 [<ffffffff80065a0b>] nmi_watchdog_tick+0x157/0x1d3 [<ffffffff80065629>] default_do_nmi+0x81/0x225 [<ffffffff80065896>] do_nmi+0x43/0x61 [<ffffffff80064eef>] nmi+0x7f/0x88 [<ffffffff801543fb>] list_del+0xb/0x71 <<EOE>> [<ffffffff8005c130>] cache_alloc_refill+0xf1/0x186 [<ffffffff800dbbc3>] __kmalloc+0x95/0x9f [<ffffffff880bad04>] :cciss:cciss_ioctl+0x50a/0xc58 [<ffffffff8000cf57>] do_lookup+0x65/0x1e6 [<ffffffff8000d47a>] dput+0x2c/0x114 [<ffffffff80057e46>] kobject_get+0x12/0x17 [<ffffffff8005ab33>] exact_lock+0xc/0x14 [<ffffffff880bb47c>] :cciss:do_ioctl+0x2a/0x39 [<ffffffff880bb69f>] :cciss:cciss_compat_ioctl+0x214/0x249 [<ffffffff800e5952>] blkdev_open+0x0/0x4f [<ffffffff800e5975>] blkdev_open+0x23/0x4f [<ffffffff80146b1c>] compat_blkdev_ioctl+0x4c/0x5f [<ffffffff800fb0b8>] compat_sys_ioctl+0xc5/0x2b2 [<ffffffff8006149d>] sysenter_do_call+0x1e/0x76 BUG: warning at drivers/input/serio/i8042.c:851/i8042_panic_blink() (Not tainted) Call Trace: <NMI> [<ffffffff8020b245>] i8042_panic_blink+0x278/0x2a5 [<ffffffff80091bc5>] panic+0xec/0x1eb [<ffffffff8006bad1>] _show_stack+0xdb/0xea [<ffffffff8006bbc4>] show_registers+0xe4/0x100 [<ffffffff800652c5>] die_nmi+0x66/0xa3 [<ffffffff80065a0b>] nmi_watchdog_tick+0x157/0x1d3 [<ffffffff80065629>] default_do_nmi+0x81/0x225 [<ffffffff80065896>] do_nmi+0x43/0x61 [<ffffffff80064eef>] nmi+0x7f/0x88 [<ffffffff801543fb>] list_del+0xb/0x71 <<EOE>> [<ffffffff8005c130>] cache_alloc_refill+0xf1/0x186 [<ffffffff800dbbc3>] __kmalloc+0x95/0x9f [<ffffffff880bad04>] :cciss:cciss_ioctl+0x50a/0xc58 [<ffffffff8000cf57>] do_lookup+0x65/0x1e6 [<ffffffff8000d47a>] dput+0x2c/0x114 [<ffffffff80057e46>] kobject_get+0x12/0x17 [<ffffffff8005ab33>] exact_lock+0xc/0x14 [<ffffffff880bb47c>] :cciss:do_ioctl+0x2a/0x39 [<ffffffff880bb69f>] :cciss:cciss_compat_ioctl+0x214/0x249 [<ffffffff800e5952>] blkdev_open+0x0/0x4f [<ffffffff800e5975>] blkdev_open+0x23/0x4f [<ffffffff80146b1c>] compat_blkdev_ioctl+0x4c/0x5f [<ffffffff800fb0b8>] compat_sys_ioctl+0xc5/0x2b2 [<ffffffff8006149d>] sysenter_do_call+0x1e/0x76
Bjorn, I do not see ext4 calls in your trace - why do you think this is the same issue? If not, please open an issue with via your redhat support channel so we can get our field people to help gather information. Thanks!
At any rate, the bug is not fixed in the kernel you are testing. See comment #8.
Ric, (In reply to comment #12) > Bjorn, I do not see ext4 calls in your trace - why do you think this is the > same issue? > > If not, please open an issue with via your redhat support channel so we can get > our field people to help gather information. Thanks! I searched Bugzilla and found Bug #594446 in which the Op used almost exactly the same command to let the kernel crash. (I used "mount -o uid=foo,gid=foo -t ext4 /dev/mapper/foop1 /home/foo") Eric replied to #594446 and told it's as a duplicate of this Bug #614957. So if this looks like a new bug I'll open an issue via RH support.
Thanks!
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html
*** Bug 684048 has been marked as a duplicate of this bug. ***