Bug 479522 - kernel hangs with multiple BUG: soft lockup - CPU#0 stuck for 10s! [repquota:6377]
Summary: kernel hangs with multiple BUG: soft lockup - CPU#0 stuck for 10s! [repquota:...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.2
Hardware: x86_64
OS: Linux
low
high
Target Milestone: ---
: ---
Assignee: Red Hat Kernel Manager
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-01-10 17:15 UTC by Milan Kerslager
Modified: 2009-08-27 15:16 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-04-07 19:36:42 UTC
Target Upstream Version:


Attachments (Terms of Use)
messages from the kernel with BUG outputs (66.21 KB, text/plain)
2009-01-11 09:08 UTC, Milan Kerslager
no flags Details
lspci (2.06 KB, text/plain)
2009-01-30 07:52 UTC, Milan Kerslager
no flags Details
lspci -nvvv (20.44 KB, text/plain)
2009-01-30 07:52 UTC, Milan Kerslager
no flags Details
dmesg output after the start (29.53 KB, text/plain)
2009-01-30 08:14 UTC, Milan Kerslager
no flags Details
kernel logs from /var/log/messages (271.44 KB, text/plain)
2009-06-05 22:08 UTC, Gustavo Homem
no flags Details

Description Milan Kerslager 2009-01-10 17:15:28 UTC
Today I hit serveral BUG reports and then hard lockup with error message:

Jan 10 04:16:33 opteron kernel: BUG: soft lockup - CPU#0 stuck for 10s! [repquota:6377]
Jan 10 04:16:33 opteron kernel: CPU 0:
Jan 10 04:16:33 opteron kernel: Modules linked in: nfs lockd fscache nfs_acl sunrpc netloop netbk blktap blkbk ipt_MASQUERADE bridge hidp l2cap bluetooth ip_
conntrack_netbios_ns ipt_REJECT ipt_LOG xt_limit ipt_recent xt_state xt_tcpudp iptable_filter iptable_nat ip_nat ip_conntrack nfnetlink iptable_mangle ip_tab
les x_tables dm_multipath raid456 xor video sbs i2c_ec backlight button battery asus_acpi ac lp 8250_pnp snd_hda_intel snd_hda_codec snd_via82xx gameport snd
_ac97_codec ac97_bus snd_mpu401_uart snd_rawmidi snd_seq_dummy shpchp snd_seq_oss snd_seq_midi_event k8_edac 8250 snd_seq serial_core snd_seq_device snd_pcm_
oss edac_mc snd_mixer_oss k8temp parport_pc parport snd_pcm i2c_nforce2 hwmon snd_timer i2c_core serio_raw tg3 snd soundcore snd_page_alloc sg pcspkr dm_snap
shot dm_zero dm_mirror dm_mod sata_nv libata sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Jan 10 04:16:33 opteron kernel: Pid: 6377, comm: repquota Not tainted 2.6.18-92.1.22.el5xen #1
Jan 10 04:16:33 opteron kernel: RIP: e030:[<ffffffff8025383b>]  [<ffffffff8025383b>] vfs_quota_sync+0x8b/0x15a
Jan 10 04:16:33 opteron kernel: RSP: e02b:ffff8803032d1e38  EFLAGS: 00000202
Jan 10 04:16:33 opteron kernel: RAX: 0000000000000031 RBX: ffff8803e7bba000 RCX: ffff88035b3eb6b0
Jan 10 04:16:33 opteron kernel: RDX: ffffffff80264868 RSI: 0000000000000002 RDI: ffffffff8050b864
Jan 10 04:16:33 opteron kernel: RBP: ffff88035b3eb680 R08: ffff8803032d1e58 R09: 0000000000000000
Jan 10 04:16:33 opteron kernel: R10: ffff880327a06480 R11: 0000000000800001 R12: ffff8803e7bba188
Jan 10 04:16:33 opteron kernel: R13: 0000000000000000 R14: 0000000000000000 R15: ffff8803e7bba120
Jan 10 04:16:33 opteron kernel: FS:  00002aaf90ef7250(0000) GS:ffffffff805b0000(0000) knlGS:0000000000000000
Jan 10 04:16:33 opteron kernel: CS:  e033 DS: 0000 ES: 0000
Jan 10 04:16:33 opteron kernel: 
Jan 10 04:16:33 opteron kernel: Call Trace:
Jan 10 04:16:33 opteron kernel:  [<ffffffff80253869>] vfs_quota_sync+0xb9/0x15a
Jan 10 04:16:33 opteron kernel:  [<ffffffff802eec78>] quota_sync_sb+0x17/0xf0
Jan 10 04:16:33 opteron kernel:  [<ffffffff802ef330>] sys_quotactl+0x4c8/0x5fc
Jan 10 04:16:33 opteron kernel:  [<ffffffff802ae94a>] audit_syscall_entry+0x16e/0x1a1
Jan 10 04:16:33 opteron kernel:  [<ffffffff802602f9>] tracesys+0xab/0xb6
Jan 10 04:16:33 opteron kernel: 

I have 2.6.18-92.1.22.el5xen kernel (the process runs in xen environment).

Comment 1 Milan Kerslager 2009-01-11 09:08:46 UTC
Created attachment 328654 [details]
messages from the kernel with BUG outputs

Comment 2 Gustavo Homem 2009-01-11 12:17:51 UTC
This problem started to happen after upgrading from 2.6.18-53.1.13.el5 to 2.6.18-92.1.22.el5 and totally hangs the system.

Here are some more logs:

EXT3-fs error (device sda5): ext3_lookup: unlinked inode 29427759 in dir #55378055
Aborting journal on device sda5.
ext3_abort called.
EXT3-fs error (device sda5): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
journal commit I/O error
journal commit I/O error
journal commit I/O error
journal commit I/O error
journal commit I/O error
EXT3-fs error (device sda5): ext3_lookup: unlinked inode 29427759 in dir #55378055
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_frozen_data
__journal_remove_journal_head: freeing b_frozen_data
journal commit I/O error
journal commit I/O error
EXT3-fs error (device sda5): ext3_lookup: unlinked inode 29427759 in dir #55378055
journal commit I/O error
EXT3-fs error (device sda5): ext3_lookup: unlinked inode 29427759 in dir #55378055
BUG: soft lockup - CPU#0 stuck for 10s! [repquota:3806]


Jan 11 01:45:01 multibasevat503024368 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [repquota:8270]
Jan 11 01:45:01 multibasevat503024368 kernel:
Jan 11 01:45:01 multibasevat503024368 kernel: Pid: 8270, comm:             repquota
Jan 11 01:45:01 multibasevat503024368 kernel: EIP: 0060:[<c0609ba4>] CPU: 3
Jan 11 01:45:01 multibasevat503024368 kernel: EIP is at _spin_lock+0x3/0xf
Jan 11 01:45:01 multibasevat503024368 kernel:  EFLAGS: 00000246    Not tainted  (2.6.18-92.1.22.el5 #1)
Jan 11 01:45:01 multibasevat503024368 kernel: EAX: c068039c EBX: f40ca580 ECX: f798c200 EDX: 00000002
Jan 11 01:45:01 multibasevat503024368 kernel: ESI: f40ca580 EDI: f798c2e4 EBP: 00000000 DS: 007b ES: 007b
Jan 11 01:45:01 multibasevat503024368 kernel: CR0: 8005003b CR2: 0804fda0 CR3: 32918000 CR4: 000006d0
Jan 11 01:45:01 multibasevat503024368 kernel:  [<c04995bd>] dqput+0x53/0x15d
Jan 11 01:45:01 multibasevat503024368 kernel:  [<c049aaca>] vfs_quota_sync+0x9b/0x131
Jan 11 01:45:01 multibasevat503024368 kernel:  [<c049c851>] quota_sync_sb+0x11/0xcc
Jan 11 01:45:01 multibasevat503024368 kernel:  [<c0438859>] down_read+0x8/0x11
Jan 11 01:45:01 multibasevat503024368 kernel:  [<c049cec0>] sys_quotactl+0x4c7/0x5f3
Jan 11 01:45:01 multibasevat503024368 kernel:  [<c060acb3>] do_page_fault+0x20a/0x4b8
Jan 11 01:45:01 multibasevat503024368 kernel:  [<c060ad1d>] do_page_fault+0x274/0x4b8
Jan 11 01:45:01 multibasevat503024368 kernel:  [<c0404eff>] syscall_call+0x7/0xb
Jan 11 01:45:01 multibasevat503024368 kernel:  =======================

Comment 3 Jiri Pirko 2009-01-28 16:42:16 UTC
Milan, Gustavo, do you experience this issue on xen host only or on bare hardware too? Does the problem appear with 2.6.18-92.1.18 kernel? Thanks.

Comment 4 Gustavo Homem 2009-01-28 19:25:27 UTC
Hello Jiri,

(In reply to comment #3)
> Milan, Gustavo, do you experience this issue on xen host only or on bare
> hardware too? 

I don't know about Xen. For us it happens on bare hardware.

> Does the problem appear with 2.6.18-92.1.18 kernel? Thanks.

We did not test on other kernels. 

What I tested results in:

2.6.18-53.1.13.el5 -> rock solid for more than one year
2.6.18-92.1.22.el5 -> crashes after some minutes of uptime

Comment 5 Milan Kerslager 2009-01-28 19:44:38 UTC
I have bare hardware with kernel-xen. No virtual machine has been run in that time (I use host systems as builder hosts only time to time). This was the only oops I saw in the past 8 months (since I have this machine). This is the main server for our school (700 users total, 150 online, Samba, roaming profiles, Web, Webmail, 2 CPUs, 4 cores total). And I have no subsequent oops after reboot with the same kernel since then.

Comment 6 Gustavo Homem 2009-01-28 19:56:05 UTC
On my side the problem is deterministic. With 2.6.18-92.1.22.el5 it crashes with the reported error messages after some minutes.

Comment 7 Jiri Pirko 2009-01-29 09:25:56 UTC
Ok, could you please provide the type of machine you are experiencing this issue on? It would be also helpful if you can try to trigger the issue with 2.6.18-92.1.18 kernel. Thanks a lot.

Comment 8 Gustavo Homem 2009-01-29 12:03:01 UTC
(In reply to comment #7)
> Ok, could you please provide the type of machine you are experiencing this
> issue on?

Sure:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 10
cpu MHz         : 3200.495
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl cid cx16 xtpr lahf_lm
bogomips        : 6403.75

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 10
cpu MHz         : 3200.495
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl cid cx16 xtpr lahf_lm
bogomips        : 6400.26

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 10
cpu MHz         : 3200.495
cache size      : 2048 KB
physical id     : 3
siblings        : 2
core id         : 0
cpu cores       : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl cid cx16 xtpr lahf_lm
bogomips        : 6400.27

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 10
cpu MHz         : 3200.495
cache size      : 2048 KB
physical id     : 3
siblings        : 2
core id         : 0
cpu cores       : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl cid cx16 xtpr lahf_lm
bogomips        : 6400.31

             total       used       free     shared    buffers     cached
Mem:       3631836    3224712     407124          0     284308    2519580
-/+ buffers/cache:     420824    3211012
Swap:      2096472        424    2096048


The machine does a lot of I/O and has quotas activated for all users.

 It would be also helpful if you can try to trigger the issue with
> 2.6.18-92.1.18 kernel. Thanks a lot.

I really can't do it. It is a heavily used production machine on a remote data center.

Comment 9 Jiri Pirko 2009-01-29 12:29:21 UTC
(In reply to comment #8)
Well I ment more like machine vendor and type, storage type etc. Processor information are too general. I would like to reproduce this issue so I need to catch on something. 
> 
> The machine does a lot of I/O and has quotas activated for all users.
> 
>  It would be also helpful if you can try to trigger the issue with
> > 2.6.18-92.1.18 kernel. Thanks a lot.
> 
> I really can't do it. It is a heavily used production machine on a remote data
> center.
Ok - I understand.

Comment 10 Milan Kerslager 2009-01-29 12:59:08 UTC
I have quota active for /home too (LVM on top of double RAID1). Even nothing has changed after oops (just reboot with the same kernel, no fsck etc), no more oops has been catched. There was four freshly connected 1TB Seagage 7200.11 (with buggy firmware) but has not been used for file I/O, only RAID1 has been synced before oops and then layed unactive. Three of them died shortly then but with no impact on running system.

Comment 11 Gustavo Homem 2009-01-29 17:15:34 UTC
(In reply to comment #9)
> (In reply to comment #8)
> Well I ment more like machine vendor and type, storage type etc. Processor
> information are too general. I would like to reproduce this issue so I need to
> catch on something. 
> > 

This is summary from the datacenter interface:

Qty     Component
3 	Generic \ 1024 MB \ DDR2 400 ECC Reg
1 	Generic \ 1024 MB \ DDR2 400 ECC
2 	Seagate \ 500GB:SATA:7200rpm \ ST3500630AS
2 	Intel \ 3.2 GHz 800FSB \ P4 Xeon
1 	Dell \ PE SC1425 \ 800FSB Dual Xeon
1 	Unknown \ Onboard \ SATA

The storage is local and the sata driver is ata_pixx.

Comment 12 Milan Kerslager 2009-01-30 07:50:47 UTC
I have (dmidecode | grep 'Product Name'): ASUS KFN32-D SLI

Comment 13 Milan Kerslager 2009-01-30 07:52:07 UTC
Created attachment 330446 [details]
lspci

Comment 14 Milan Kerslager 2009-01-30 07:52:47 UTC
Created attachment 330447 [details]
lspci -nvvv

Comment 15 Milan Kerslager 2009-01-30 08:14:00 UTC
Created attachment 330448 [details]
dmesg output after the start

Comment 16 Gustavo Homem 2009-04-03 19:13:50 UTC
Hello,

Is this being worked on? We're stuck on an old kernel and there are security updates to do.

Thanks

Comment 17 Milan Kerslager 2009-04-04 08:31:29 UTC
I have latest kernel and the bug did not appear anymore.

Comment 18 Gustavo Homem 2009-04-04 16:24:26 UTC
Thanks for the comment Milan.

Note: 

There's nothing on the latest changelog about this:

http://rhn.redhat.com/errata/RHSA-2009-0326.html

Comment 19 Linda Wang 2009-04-07 19:36:42 UTC
RHSA-2009:0326 would be the latest 5.3.z kernel.

per comment#17, close as worksforme. If you run into this issue
again, please feel free to reopen this bug. thanks.

Comment 20 Gustavo Homem 2009-06-05 22:07:59 UTC
Unfortunately this bugs is still present on the latest kernel: 2.6.18-128.1.10.el5.

I can reproduce it easily on one of my machines. If I fall back to 
2.6.18-53.1.13.el5 it runs stable for months.

Comment 21 Gustavo Homem 2009-06-05 22:08:48 UTC
Created attachment 346725 [details]
kernel logs from /var/log/messages

Comment 22 Gustavo Homem 2009-06-05 22:19:40 UTC
Is this problem related to this one?

https://bugzilla.redhat.com/show_bug.cgi?id=465845

Comment 23 Gustavo Homem 2009-08-27 15:16:15 UTC
After an fsck on /home partition and upgrade to the most recent kernel, the system seems to be stable. No idea why kernel 2.6.18-53.1.13.el5 would be stable without the fsck whereas all the others crashed.


Note You need to log in before you can comment on or make changes to this bug.