Bug 479522 - kernel hangs with multiple BUG: soft lockup - CPU#0 stuck for 10s! [repquota:6377]
kernel hangs with multiple BUG: soft lockup - CPU#0 stuck for 10s! [repquota:...
Status: CLOSED WORKSFORME
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.2
x86_64 Linux
low Severity high
: ---
: ---
Assigned To: Red Hat Kernel Manager
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-01-10 12:15 EST by Milan Kerslager
Modified: 2009-08-27 11:16 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-04-07 15:36:42 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
messages from the kernel with BUG outputs (66.21 KB, text/plain)
2009-01-11 04:08 EST, Milan Kerslager
no flags Details
lspci (2.06 KB, text/plain)
2009-01-30 02:52 EST, Milan Kerslager
no flags Details
lspci -nvvv (20.44 KB, text/plain)
2009-01-30 02:52 EST, Milan Kerslager
no flags Details
dmesg output after the start (29.53 KB, text/plain)
2009-01-30 03:14 EST, Milan Kerslager
no flags Details
kernel logs from /var/log/messages (271.44 KB, text/plain)
2009-06-05 18:08 EDT, Gustavo Homem
no flags Details

  None (edit)
Description Milan Kerslager 2009-01-10 12:15:28 EST
Today I hit serveral BUG reports and then hard lockup with error message:

Jan 10 04:16:33 opteron kernel: BUG: soft lockup - CPU#0 stuck for 10s! [repquota:6377]
Jan 10 04:16:33 opteron kernel: CPU 0:
Jan 10 04:16:33 opteron kernel: Modules linked in: nfs lockd fscache nfs_acl sunrpc netloop netbk blktap blkbk ipt_MASQUERADE bridge hidp l2cap bluetooth ip_
conntrack_netbios_ns ipt_REJECT ipt_LOG xt_limit ipt_recent xt_state xt_tcpudp iptable_filter iptable_nat ip_nat ip_conntrack nfnetlink iptable_mangle ip_tab
les x_tables dm_multipath raid456 xor video sbs i2c_ec backlight button battery asus_acpi ac lp 8250_pnp snd_hda_intel snd_hda_codec snd_via82xx gameport snd
_ac97_codec ac97_bus snd_mpu401_uart snd_rawmidi snd_seq_dummy shpchp snd_seq_oss snd_seq_midi_event k8_edac 8250 snd_seq serial_core snd_seq_device snd_pcm_
oss edac_mc snd_mixer_oss k8temp parport_pc parport snd_pcm i2c_nforce2 hwmon snd_timer i2c_core serio_raw tg3 snd soundcore snd_page_alloc sg pcspkr dm_snap
shot dm_zero dm_mirror dm_mod sata_nv libata sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Jan 10 04:16:33 opteron kernel: Pid: 6377, comm: repquota Not tainted 2.6.18-92.1.22.el5xen #1
Jan 10 04:16:33 opteron kernel: RIP: e030:[<ffffffff8025383b>]  [<ffffffff8025383b>] vfs_quota_sync+0x8b/0x15a
Jan 10 04:16:33 opteron kernel: RSP: e02b:ffff8803032d1e38  EFLAGS: 00000202
Jan 10 04:16:33 opteron kernel: RAX: 0000000000000031 RBX: ffff8803e7bba000 RCX: ffff88035b3eb6b0
Jan 10 04:16:33 opteron kernel: RDX: ffffffff80264868 RSI: 0000000000000002 RDI: ffffffff8050b864
Jan 10 04:16:33 opteron kernel: RBP: ffff88035b3eb680 R08: ffff8803032d1e58 R09: 0000000000000000
Jan 10 04:16:33 opteron kernel: R10: ffff880327a06480 R11: 0000000000800001 R12: ffff8803e7bba188
Jan 10 04:16:33 opteron kernel: R13: 0000000000000000 R14: 0000000000000000 R15: ffff8803e7bba120
Jan 10 04:16:33 opteron kernel: FS:  00002aaf90ef7250(0000) GS:ffffffff805b0000(0000) knlGS:0000000000000000
Jan 10 04:16:33 opteron kernel: CS:  e033 DS: 0000 ES: 0000
Jan 10 04:16:33 opteron kernel: 
Jan 10 04:16:33 opteron kernel: Call Trace:
Jan 10 04:16:33 opteron kernel:  [<ffffffff80253869>] vfs_quota_sync+0xb9/0x15a
Jan 10 04:16:33 opteron kernel:  [<ffffffff802eec78>] quota_sync_sb+0x17/0xf0
Jan 10 04:16:33 opteron kernel:  [<ffffffff802ef330>] sys_quotactl+0x4c8/0x5fc
Jan 10 04:16:33 opteron kernel:  [<ffffffff802ae94a>] audit_syscall_entry+0x16e/0x1a1
Jan 10 04:16:33 opteron kernel:  [<ffffffff802602f9>] tracesys+0xab/0xb6
Jan 10 04:16:33 opteron kernel: 

I have 2.6.18-92.1.22.el5xen kernel (the process runs in xen environment).
Comment 1 Milan Kerslager 2009-01-11 04:08:46 EST
Created attachment 328654 [details]
messages from the kernel with BUG outputs
Comment 2 Gustavo Homem 2009-01-11 07:17:51 EST
This problem started to happen after upgrading from 2.6.18-53.1.13.el5 to 2.6.18-92.1.22.el5 and totally hangs the system.

Here are some more logs:

EXT3-fs error (device sda5): ext3_lookup: unlinked inode 29427759 in dir #55378055
Aborting journal on device sda5.
ext3_abort called.
EXT3-fs error (device sda5): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
journal commit I/O error
journal commit I/O error
journal commit I/O error
journal commit I/O error
journal commit I/O error
EXT3-fs error (device sda5): ext3_lookup: unlinked inode 29427759 in dir #55378055
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_frozen_data
__journal_remove_journal_head: freeing b_frozen_data
journal commit I/O error
journal commit I/O error
EXT3-fs error (device sda5): ext3_lookup: unlinked inode 29427759 in dir #55378055
journal commit I/O error
EXT3-fs error (device sda5): ext3_lookup: unlinked inode 29427759 in dir #55378055
BUG: soft lockup - CPU#0 stuck for 10s! [repquota:3806]


Jan 11 01:45:01 multibasevat503024368 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [repquota:8270]
Jan 11 01:45:01 multibasevat503024368 kernel:
Jan 11 01:45:01 multibasevat503024368 kernel: Pid: 8270, comm:             repquota
Jan 11 01:45:01 multibasevat503024368 kernel: EIP: 0060:[<c0609ba4>] CPU: 3
Jan 11 01:45:01 multibasevat503024368 kernel: EIP is at _spin_lock+0x3/0xf
Jan 11 01:45:01 multibasevat503024368 kernel:  EFLAGS: 00000246    Not tainted  (2.6.18-92.1.22.el5 #1)
Jan 11 01:45:01 multibasevat503024368 kernel: EAX: c068039c EBX: f40ca580 ECX: f798c200 EDX: 00000002
Jan 11 01:45:01 multibasevat503024368 kernel: ESI: f40ca580 EDI: f798c2e4 EBP: 00000000 DS: 007b ES: 007b
Jan 11 01:45:01 multibasevat503024368 kernel: CR0: 8005003b CR2: 0804fda0 CR3: 32918000 CR4: 000006d0
Jan 11 01:45:01 multibasevat503024368 kernel:  [<c04995bd>] dqput+0x53/0x15d
Jan 11 01:45:01 multibasevat503024368 kernel:  [<c049aaca>] vfs_quota_sync+0x9b/0x131
Jan 11 01:45:01 multibasevat503024368 kernel:  [<c049c851>] quota_sync_sb+0x11/0xcc
Jan 11 01:45:01 multibasevat503024368 kernel:  [<c0438859>] down_read+0x8/0x11
Jan 11 01:45:01 multibasevat503024368 kernel:  [<c049cec0>] sys_quotactl+0x4c7/0x5f3
Jan 11 01:45:01 multibasevat503024368 kernel:  [<c060acb3>] do_page_fault+0x20a/0x4b8
Jan 11 01:45:01 multibasevat503024368 kernel:  [<c060ad1d>] do_page_fault+0x274/0x4b8
Jan 11 01:45:01 multibasevat503024368 kernel:  [<c0404eff>] syscall_call+0x7/0xb
Jan 11 01:45:01 multibasevat503024368 kernel:  =======================
Comment 3 Jiri Pirko 2009-01-28 11:42:16 EST
Milan, Gustavo, do you experience this issue on xen host only or on bare hardware too? Does the problem appear with 2.6.18-92.1.18 kernel? Thanks.
Comment 4 Gustavo Homem 2009-01-28 14:25:27 EST
Hello Jiri,

(In reply to comment #3)
> Milan, Gustavo, do you experience this issue on xen host only or on bare
> hardware too? 

I don't know about Xen. For us it happens on bare hardware.

> Does the problem appear with 2.6.18-92.1.18 kernel? Thanks.

We did not test on other kernels. 

What I tested results in:

2.6.18-53.1.13.el5 -> rock solid for more than one year
2.6.18-92.1.22.el5 -> crashes after some minutes of uptime
Comment 5 Milan Kerslager 2009-01-28 14:44:38 EST
I have bare hardware with kernel-xen. No virtual machine has been run in that time (I use host systems as builder hosts only time to time). This was the only oops I saw in the past 8 months (since I have this machine). This is the main server for our school (700 users total, 150 online, Samba, roaming profiles, Web, Webmail, 2 CPUs, 4 cores total). And I have no subsequent oops after reboot with the same kernel since then.
Comment 6 Gustavo Homem 2009-01-28 14:56:05 EST
On my side the problem is deterministic. With 2.6.18-92.1.22.el5 it crashes with the reported error messages after some minutes.
Comment 7 Jiri Pirko 2009-01-29 04:25:56 EST
Ok, could you please provide the type of machine you are experiencing this issue on? It would be also helpful if you can try to trigger the issue with 2.6.18-92.1.18 kernel. Thanks a lot.
Comment 8 Gustavo Homem 2009-01-29 07:03:01 EST
(In reply to comment #7)
> Ok, could you please provide the type of machine you are experiencing this
> issue on?

Sure:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 10
cpu MHz         : 3200.495
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl cid cx16 xtpr lahf_lm
bogomips        : 6403.75

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 10
cpu MHz         : 3200.495
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl cid cx16 xtpr lahf_lm
bogomips        : 6400.26

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 10
cpu MHz         : 3200.495
cache size      : 2048 KB
physical id     : 3
siblings        : 2
core id         : 0
cpu cores       : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl cid cx16 xtpr lahf_lm
bogomips        : 6400.27

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 10
cpu MHz         : 3200.495
cache size      : 2048 KB
physical id     : 3
siblings        : 2
core id         : 0
cpu cores       : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl cid cx16 xtpr lahf_lm
bogomips        : 6400.31

             total       used       free     shared    buffers     cached
Mem:       3631836    3224712     407124          0     284308    2519580
-/+ buffers/cache:     420824    3211012
Swap:      2096472        424    2096048


The machine does a lot of I/O and has quotas activated for all users.

 It would be also helpful if you can try to trigger the issue with
> 2.6.18-92.1.18 kernel. Thanks a lot.

I really can't do it. It is a heavily used production machine on a remote data center.
Comment 9 Jiri Pirko 2009-01-29 07:29:21 EST
(In reply to comment #8)
Well I ment more like machine vendor and type, storage type etc. Processor information are too general. I would like to reproduce this issue so I need to catch on something. 
> 
> The machine does a lot of I/O and has quotas activated for all users.
> 
>  It would be also helpful if you can try to trigger the issue with
> > 2.6.18-92.1.18 kernel. Thanks a lot.
> 
> I really can't do it. It is a heavily used production machine on a remote data
> center.
Ok - I understand.
Comment 10 Milan Kerslager 2009-01-29 07:59:08 EST
I have quota active for /home too (LVM on top of double RAID1). Even nothing has changed after oops (just reboot with the same kernel, no fsck etc), no more oops has been catched. There was four freshly connected 1TB Seagage 7200.11 (with buggy firmware) but has not been used for file I/O, only RAID1 has been synced before oops and then layed unactive. Three of them died shortly then but with no impact on running system.
Comment 11 Gustavo Homem 2009-01-29 12:15:34 EST
(In reply to comment #9)
> (In reply to comment #8)
> Well I ment more like machine vendor and type, storage type etc. Processor
> information are too general. I would like to reproduce this issue so I need to
> catch on something. 
> > 

This is summary from the datacenter interface:

Qty     Component
3 	Generic \ 1024 MB \ DDR2 400 ECC Reg
1 	Generic \ 1024 MB \ DDR2 400 ECC
2 	Seagate \ 500GB:SATA:7200rpm \ ST3500630AS
2 	Intel \ 3.2 GHz 800FSB \ P4 Xeon
1 	Dell \ PE SC1425 \ 800FSB Dual Xeon
1 	Unknown \ Onboard \ SATA

The storage is local and the sata driver is ata_pixx.
Comment 12 Milan Kerslager 2009-01-30 02:50:47 EST
I have (dmidecode | grep 'Product Name'): ASUS KFN32-D SLI
Comment 13 Milan Kerslager 2009-01-30 02:52:07 EST
Created attachment 330446 [details]
lspci
Comment 14 Milan Kerslager 2009-01-30 02:52:47 EST
Created attachment 330447 [details]
lspci -nvvv
Comment 15 Milan Kerslager 2009-01-30 03:14:00 EST
Created attachment 330448 [details]
dmesg output after the start
Comment 16 Gustavo Homem 2009-04-03 15:13:50 EDT
Hello,

Is this being worked on? We're stuck on an old kernel and there are security updates to do.

Thanks
Comment 17 Milan Kerslager 2009-04-04 04:31:29 EDT
I have latest kernel and the bug did not appear anymore.
Comment 18 Gustavo Homem 2009-04-04 12:24:26 EDT
Thanks for the comment Milan.

Note: 

There's nothing on the latest changelog about this:

http://rhn.redhat.com/errata/RHSA-2009-0326.html
Comment 19 Linda Wang 2009-04-07 15:36:42 EDT
RHSA-2009:0326 would be the latest 5.3.z kernel.

per comment#17, close as worksforme. If you run into this issue
again, please feel free to reopen this bug. thanks.
Comment 20 Gustavo Homem 2009-06-05 18:07:59 EDT
Unfortunately this bugs is still present on the latest kernel: 2.6.18-128.1.10.el5.

I can reproduce it easily on one of my machines. If I fall back to 
2.6.18-53.1.13.el5 it runs stable for months.
Comment 21 Gustavo Homem 2009-06-05 18:08:48 EDT
Created attachment 346725 [details]
kernel logs from /var/log/messages
Comment 22 Gustavo Homem 2009-06-05 18:19:40 EDT
Is this problem related to this one?

https://bugzilla.redhat.com/show_bug.cgi?id=465845
Comment 23 Gustavo Homem 2009-08-27 11:16:15 EDT
After an fsck on /home partition and upgrade to the most recent kernel, the system seems to be stable. No idea why kernel 2.6.18-53.1.13.el5 would be stable without the fsck whereas all the others crashed.

Note You need to log in before you can comment on or make changes to this bug.