Hide Forgot
Today I hit serveral BUG reports and then hard lockup with error message: Jan 10 04:16:33 opteron kernel: BUG: soft lockup - CPU#0 stuck for 10s! [repquota:6377] Jan 10 04:16:33 opteron kernel: CPU 0: Jan 10 04:16:33 opteron kernel: Modules linked in: nfs lockd fscache nfs_acl sunrpc netloop netbk blktap blkbk ipt_MASQUERADE bridge hidp l2cap bluetooth ip_ conntrack_netbios_ns ipt_REJECT ipt_LOG xt_limit ipt_recent xt_state xt_tcpudp iptable_filter iptable_nat ip_nat ip_conntrack nfnetlink iptable_mangle ip_tab les x_tables dm_multipath raid456 xor video sbs i2c_ec backlight button battery asus_acpi ac lp 8250_pnp snd_hda_intel snd_hda_codec snd_via82xx gameport snd _ac97_codec ac97_bus snd_mpu401_uart snd_rawmidi snd_seq_dummy shpchp snd_seq_oss snd_seq_midi_event k8_edac 8250 snd_seq serial_core snd_seq_device snd_pcm_ oss edac_mc snd_mixer_oss k8temp parport_pc parport snd_pcm i2c_nforce2 hwmon snd_timer i2c_core serio_raw tg3 snd soundcore snd_page_alloc sg pcspkr dm_snap shot dm_zero dm_mirror dm_mod sata_nv libata sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd Jan 10 04:16:33 opteron kernel: Pid: 6377, comm: repquota Not tainted 2.6.18-92.1.22.el5xen #1 Jan 10 04:16:33 opteron kernel: RIP: e030:[<ffffffff8025383b>] [<ffffffff8025383b>] vfs_quota_sync+0x8b/0x15a Jan 10 04:16:33 opteron kernel: RSP: e02b:ffff8803032d1e38 EFLAGS: 00000202 Jan 10 04:16:33 opteron kernel: RAX: 0000000000000031 RBX: ffff8803e7bba000 RCX: ffff88035b3eb6b0 Jan 10 04:16:33 opteron kernel: RDX: ffffffff80264868 RSI: 0000000000000002 RDI: ffffffff8050b864 Jan 10 04:16:33 opteron kernel: RBP: ffff88035b3eb680 R08: ffff8803032d1e58 R09: 0000000000000000 Jan 10 04:16:33 opteron kernel: R10: ffff880327a06480 R11: 0000000000800001 R12: ffff8803e7bba188 Jan 10 04:16:33 opteron kernel: R13: 0000000000000000 R14: 0000000000000000 R15: ffff8803e7bba120 Jan 10 04:16:33 opteron kernel: FS: 00002aaf90ef7250(0000) GS:ffffffff805b0000(0000) knlGS:0000000000000000 Jan 10 04:16:33 opteron kernel: CS: e033 DS: 0000 ES: 0000 Jan 10 04:16:33 opteron kernel: Jan 10 04:16:33 opteron kernel: Call Trace: Jan 10 04:16:33 opteron kernel: [<ffffffff80253869>] vfs_quota_sync+0xb9/0x15a Jan 10 04:16:33 opteron kernel: [<ffffffff802eec78>] quota_sync_sb+0x17/0xf0 Jan 10 04:16:33 opteron kernel: [<ffffffff802ef330>] sys_quotactl+0x4c8/0x5fc Jan 10 04:16:33 opteron kernel: [<ffffffff802ae94a>] audit_syscall_entry+0x16e/0x1a1 Jan 10 04:16:33 opteron kernel: [<ffffffff802602f9>] tracesys+0xab/0xb6 Jan 10 04:16:33 opteron kernel: I have 2.6.18-92.1.22.el5xen kernel (the process runs in xen environment).
Created attachment 328654 [details] messages from the kernel with BUG outputs
This problem started to happen after upgrading from 2.6.18-53.1.13.el5 to 2.6.18-92.1.22.el5 and totally hangs the system. Here are some more logs: EXT3-fs error (device sda5): ext3_lookup: unlinked inode 29427759 in dir #55378055 Aborting journal on device sda5. ext3_abort called. EXT3-fs error (device sda5): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only __journal_remove_journal_head: freeing b_committed_data __journal_remove_journal_head: freeing b_committed_data journal commit I/O error journal commit I/O error journal commit I/O error journal commit I/O error journal commit I/O error EXT3-fs error (device sda5): ext3_lookup: unlinked inode 29427759 in dir #55378055 __journal_remove_journal_head: freeing b_committed_data __journal_remove_journal_head: freeing b_committed_data __journal_remove_journal_head: freeing b_committed_data __journal_remove_journal_head: freeing b_frozen_data __journal_remove_journal_head: freeing b_frozen_data journal commit I/O error journal commit I/O error EXT3-fs error (device sda5): ext3_lookup: unlinked inode 29427759 in dir #55378055 journal commit I/O error EXT3-fs error (device sda5): ext3_lookup: unlinked inode 29427759 in dir #55378055 BUG: soft lockup - CPU#0 stuck for 10s! [repquota:3806] Jan 11 01:45:01 multibasevat503024368 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [repquota:8270] Jan 11 01:45:01 multibasevat503024368 kernel: Jan 11 01:45:01 multibasevat503024368 kernel: Pid: 8270, comm: repquota Jan 11 01:45:01 multibasevat503024368 kernel: EIP: 0060:[<c0609ba4>] CPU: 3 Jan 11 01:45:01 multibasevat503024368 kernel: EIP is at _spin_lock+0x3/0xf Jan 11 01:45:01 multibasevat503024368 kernel: EFLAGS: 00000246 Not tainted (2.6.18-92.1.22.el5 #1) Jan 11 01:45:01 multibasevat503024368 kernel: EAX: c068039c EBX: f40ca580 ECX: f798c200 EDX: 00000002 Jan 11 01:45:01 multibasevat503024368 kernel: ESI: f40ca580 EDI: f798c2e4 EBP: 00000000 DS: 007b ES: 007b Jan 11 01:45:01 multibasevat503024368 kernel: CR0: 8005003b CR2: 0804fda0 CR3: 32918000 CR4: 000006d0 Jan 11 01:45:01 multibasevat503024368 kernel: [<c04995bd>] dqput+0x53/0x15d Jan 11 01:45:01 multibasevat503024368 kernel: [<c049aaca>] vfs_quota_sync+0x9b/0x131 Jan 11 01:45:01 multibasevat503024368 kernel: [<c049c851>] quota_sync_sb+0x11/0xcc Jan 11 01:45:01 multibasevat503024368 kernel: [<c0438859>] down_read+0x8/0x11 Jan 11 01:45:01 multibasevat503024368 kernel: [<c049cec0>] sys_quotactl+0x4c7/0x5f3 Jan 11 01:45:01 multibasevat503024368 kernel: [<c060acb3>] do_page_fault+0x20a/0x4b8 Jan 11 01:45:01 multibasevat503024368 kernel: [<c060ad1d>] do_page_fault+0x274/0x4b8 Jan 11 01:45:01 multibasevat503024368 kernel: [<c0404eff>] syscall_call+0x7/0xb Jan 11 01:45:01 multibasevat503024368 kernel: =======================
Milan, Gustavo, do you experience this issue on xen host only or on bare hardware too? Does the problem appear with 2.6.18-92.1.18 kernel? Thanks.
Hello Jiri, (In reply to comment #3) > Milan, Gustavo, do you experience this issue on xen host only or on bare > hardware too? I don't know about Xen. For us it happens on bare hardware. > Does the problem appear with 2.6.18-92.1.18 kernel? Thanks. We did not test on other kernels. What I tested results in: 2.6.18-53.1.13.el5 -> rock solid for more than one year 2.6.18-92.1.22.el5 -> crashes after some minutes of uptime
I have bare hardware with kernel-xen. No virtual machine has been run in that time (I use host systems as builder hosts only time to time). This was the only oops I saw in the past 8 months (since I have this machine). This is the main server for our school (700 users total, 150 online, Samba, roaming profiles, Web, Webmail, 2 CPUs, 4 cores total). And I have no subsequent oops after reboot with the same kernel since then.
On my side the problem is deterministic. With 2.6.18-92.1.22.el5 it crashes with the reported error messages after some minutes.
Ok, could you please provide the type of machine you are experiencing this issue on? It would be also helpful if you can try to trigger the issue with 2.6.18-92.1.18 kernel. Thanks a lot.
(In reply to comment #7) > Ok, could you please provide the type of machine you are experiencing this > issue on? Sure: processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.20GHz stepping : 10 cpu MHz : 3200.495 cache size : 2048 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl cid cx16 xtpr lahf_lm bogomips : 6403.75 processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.20GHz stepping : 10 cpu MHz : 3200.495 cache size : 2048 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl cid cx16 xtpr lahf_lm bogomips : 6400.26 processor : 2 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.20GHz stepping : 10 cpu MHz : 3200.495 cache size : 2048 KB physical id : 3 siblings : 2 core id : 0 cpu cores : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl cid cx16 xtpr lahf_lm bogomips : 6400.27 processor : 3 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.20GHz stepping : 10 cpu MHz : 3200.495 cache size : 2048 KB physical id : 3 siblings : 2 core id : 0 cpu cores : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl cid cx16 xtpr lahf_lm bogomips : 6400.31 total used free shared buffers cached Mem: 3631836 3224712 407124 0 284308 2519580 -/+ buffers/cache: 420824 3211012 Swap: 2096472 424 2096048 The machine does a lot of I/O and has quotas activated for all users. It would be also helpful if you can try to trigger the issue with > 2.6.18-92.1.18 kernel. Thanks a lot. I really can't do it. It is a heavily used production machine on a remote data center.
(In reply to comment #8) Well I ment more like machine vendor and type, storage type etc. Processor information are too general. I would like to reproduce this issue so I need to catch on something. > > The machine does a lot of I/O and has quotas activated for all users. > > It would be also helpful if you can try to trigger the issue with > > 2.6.18-92.1.18 kernel. Thanks a lot. > > I really can't do it. It is a heavily used production machine on a remote data > center. Ok - I understand.
I have quota active for /home too (LVM on top of double RAID1). Even nothing has changed after oops (just reboot with the same kernel, no fsck etc), no more oops has been catched. There was four freshly connected 1TB Seagage 7200.11 (with buggy firmware) but has not been used for file I/O, only RAID1 has been synced before oops and then layed unactive. Three of them died shortly then but with no impact on running system.
(In reply to comment #9) > (In reply to comment #8) > Well I ment more like machine vendor and type, storage type etc. Processor > information are too general. I would like to reproduce this issue so I need to > catch on something. > > This is summary from the datacenter interface: Qty Component 3 Generic \ 1024 MB \ DDR2 400 ECC Reg 1 Generic \ 1024 MB \ DDR2 400 ECC 2 Seagate \ 500GB:SATA:7200rpm \ ST3500630AS 2 Intel \ 3.2 GHz 800FSB \ P4 Xeon 1 Dell \ PE SC1425 \ 800FSB Dual Xeon 1 Unknown \ Onboard \ SATA The storage is local and the sata driver is ata_pixx.
I have (dmidecode | grep 'Product Name'): ASUS KFN32-D SLI
Created attachment 330446 [details] lspci
Created attachment 330447 [details] lspci -nvvv
Created attachment 330448 [details] dmesg output after the start
Hello, Is this being worked on? We're stuck on an old kernel and there are security updates to do. Thanks
I have latest kernel and the bug did not appear anymore.
Thanks for the comment Milan. Note: There's nothing on the latest changelog about this: http://rhn.redhat.com/errata/RHSA-2009-0326.html
RHSA-2009:0326 would be the latest 5.3.z kernel. per comment#17, close as worksforme. If you run into this issue again, please feel free to reopen this bug. thanks.
Unfortunately this bugs is still present on the latest kernel: 2.6.18-128.1.10.el5. I can reproduce it easily on one of my machines. If I fall back to 2.6.18-53.1.13.el5 it runs stable for months.
Created attachment 346725 [details] kernel logs from /var/log/messages
Is this problem related to this one? https://bugzilla.redhat.com/show_bug.cgi?id=465845
After an fsck on /home partition and upgrade to the most recent kernel, the system seems to be stable. No idea why kernel 2.6.18-53.1.13.el5 would be stable without the fsck whereas all the others crashed.