Bug 444591 - BUG: soft lockup - CPU#0 stuck for 110s! [ksoftirqd/0:4]
BUG: soft lockup - CPU#0 stuck for 110s! [ksoftirqd/0:4]
Status: CLOSED WONTFIX
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
8
i686 Linux
low Severity high
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-04-29 08:31 EDT by Lev Shamardin
Modified: 2009-01-09 01:26 EST (History)
0 users

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-01-09 01:26:50 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
list of loaded modules (3.11 KB, text/plain)
2008-05-12 15:55 EDT, Lev Shamardin
no flags Details

  None (edit)
Description Lev Shamardin 2008-04-29 08:31:31 EDT
Description of problem:
Server dies. Unable to check if it continues to response to the keyboard.

Version-Release number of selected component (if applicable):
kernel-2.6.24.4-64.fc8.i686.rpm

How reproducible:
Reboot and wait for some time (up to several days).

Steps to Reproduce:
1. reboot
2. wait
3. 
  
Actual results:
The server hangs eventually, with log messages recorded.

Expected results:
Stable operation without hangs.

Additional info:
This is the excerpt from the /var/log/messages after crash:
Apr 25 12:59:35 vuz-002 openvpn[1697]: Connection reset, restarting [0]
Apr 25 12:59:35 vuz-002 openvpn[1697]: SIGUSR1[soft,connection-reset] received, 
process restarting
Apr 25 12:59:35 vuz-002 openvpn[1697]: WARNING: No server certificate verificati
on method has been enabled.  See http://openvpn.net/howto.html#mitm for more inf
o.
Apr 25 12:59:35 vuz-002 openvpn[1697]: Re-using SSL/TLS context
Apr 25 12:59:35 vuz-002 openvpn[1697]: LZO compression initialized
Apr 25 12:59:35 vuz-002 openvpn[1697]: RESOLVE: Cannot resolve host address: sat
an.cvg.tv: [NO_DATA] The requested name is valid but does not have an IP address
.
Apr 25 13:01:22 vuz-002 openvpn[1697]:last message repeated 7 times
Apr 25 13:01:22 vuz-002 kernel: BUG: soft lockup - CPU#0 stuck for 110s! [ksofti
rqd/0:4]
Apr 25 13:01:22 vuz-002 kernel: 
Apr 25 13:01:22 vuz-002 kernel: Pid: 4, comm: ksoftirqd/0 Tainted: P       
(2.6.24.4-64.fc8 #1)
Apr 25 13:01:22 vuz-002 kernel: EIP: 0060:[<c0435f59>] EFLAGS: 00000286 CPU: 0
Apr 25 13:01:22 vuz-002 kernel: EIP is at run_timer_softirq+0x115/0x196
Apr 25 13:01:22 vuz-002 kernel: EAX: c079bf00 EBX: 00000000 ECX: c077427d EDX:
c079bfd0
Apr 25 13:01:22 vuz-002 kernel: ESI: c0737ac0 EDI: c080df80 EBP: c05e39f7 ESP:
c079bfc0
Apr 25 13:01:22 vuz-002 kernel:  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Apr 25 13:01:22 vuz-002 kernel: CR0: 8005003b CR2: 080c1a10 CR3: 37187000 CR4:
000006d0
Apr 25 13:01:22 vuz-002 kernel: DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3:
00000000
Apr 25 13:01:22 vuz-002 kernel: DR6: ffff0ff0 DR7: 00000400
Apr 25 13:01:22 vuz-002 kernel:  [<c05e39f7>] peer_check_expire+0x0/0xd5
Apr 25 13:01:22 vuz-002 kernel:  [<c04335c2>] __do_softirq+0x66/0xd3
Apr 25 13:01:22 vuz-002 kernel:  [<c040759a>] do_softirq+0x6c/0xce
Apr 25 13:01:22 vuz-002 kernel:  [<c04334b8>] ksoftirqd+0x0/0xa4
Apr 25 13:01:22 vuz-002 kernel:  [<c04334fa>] ksoftirqd+0x42/0xa4
Apr 25 13:01:22 vuz-002 kernel:  [<c043edb4>] kthread+0x38/0x60
Apr 25 13:01:22 vuz-002 kernel:  [<c043ed7c>] kthread+0x0/0x60
Apr 25 13:01:22 vuz-002 kernel:  [<c0405e0b>] kernel_thread_helper+0x7/0x10
Apr 25 13:01:22 vuz-002 kernel:  =======================
Apr 25 13:01:22 vuz-002 kernel: ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0
action 0x2 frozen
Apr 25 13:01:22 vuz-002 kernel: ata1.00: cmd 60/08:00:9f:40:01/00:00:0f:00:00/40
tag 0 ncq 4096 in
Apr 25 13:01:22 vuz-002 kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00
Emask 0x4 (timeout)
Apr 25 13:01:22 vuz-002 kernel: ata1.00: status: { DRDY }
Apr 25 13:01:22 vuz-002 kernel: ata1: hard resetting link
Apr 25 13:01:23 vuz-002 kernel: ata1: SATA link up 1.5 Gbps (SStatus 113
SControl 300)
Apr 25 13:01:23 vuz-002 kernel: ata1.00: configured for UDMA/133
Apr 25 13:01:23 vuz-002 kernel: ata1: EH complete
Apr 25 13:01:23 vuz-002 kernel: sd 0:0:0:0: [sda] 312581808 512-byte hardware
sectors (160042 MB)
Apr 25 13:01:23 vuz-002 kernel: sd 0:0:0:0: [sda] Write Protect is off
Apr 25 13:01:23 vuz-002 kernel: sd 0:0:0:0: [sda] Write cache: enabled, read
cache: enabled, doesn't support DPO or FUA
Apr 25 13:01:40 vuz-002 shutdown[5873]: shutting down for system halt
Apr 25 13:01:51 vuz-002 smartd[1978]: smartd received signal 15: Terminated
Apr 25 13:01:51 vuz-002 smartd[1978]: smartd is exiting (exit status 0)
Apr 25 13:01:53 vuz-002 ntpd[1788]: ntpd exiting on signal 15
Apr 25 13:01:53 vuz-002 kernel: Kernel logging (proc) stopped.
Apr 25 13:01:53 vuz-002 kernel: Kernel log daemon terminating.
Apr 25 13:01:54 vuz-002 rsyslogd: [origin software="rsyslogd" swVersion="2.0.2"
x-pid="1746" x-info="http://www.rsyslog.com"] exiting on signal 15.
Comment 1 Chuck Ebbert 2008-05-01 02:28:01 EDT
kernel/timer.c::__run_timers()::663

                        spin_unlock_irq(&base->lock);

Does this server have a large number of connections to other machines?



Comment 2 Dave Jones 2008-05-01 08:23:41 EDT
also, what binary module is causing the taint ?
Comment 3 Lev Shamardin 2008-05-04 01:13:39 EDT
No, this machine does not (at least should not) have a large number of
connections. Normally it should have only one openvpn session and several ssh
logins.

Unfortunately the machine is down again (though last time I've added
processor.max_cstate=1 to kernel parameters), I'll post the list of modules as
soon as it will be possible to reboot the host.
Comment 4 Lev Shamardin 2008-05-12 15:54:07 EDT
Okay, finally got the host back for the experiments.

So,
nvidia: module license 'NVIDIA' taints kernel.
Comment 5 Lev Shamardin 2008-05-12 15:55:30 EDT
Created attachment 305172 [details]
list of loaded modules
Comment 6 Lev Shamardin 2008-05-15 04:28:03 EDT
I observed the same behavior on another host. It has idential hardware and
software configuration. This is the part of the log:

May 14 22:59:56 vuz-003 kernel: Clocksource tsc unstable (delta = 789274161660 ns)
May 14 22:59:56 vuz-003 kernel: BUG: soft lockup - CPU#0 stuck for 13s!
[cvg-player:2178]
May 14 22:59:56 vuz-003 kernel: 
May 14 22:59:56 vuz-003 kernel: Pid: 2178, comm: cvg-player Tainted: P       
(2.6.24.4-64.fc8 #1)
May 14 22:59:56 vuz-003 kernel: EIP: 0073:[<008302a0>] EFLAGS: 00000202 CPU: 0
May 14 22:59:56 vuz-003 kernel: EIP is at 0x8302a0
May 14 22:59:56 vuz-003 kernel: EAX: 0a054aa8 EBX: 0091aff4 ECX: 00000430 EDX:
0091c1b0
May 14 22:59:56 vuz-003 kernel: ESI: 00002000 EDI: 0091c1b0 EBP: bfc8172c ESP:
bfc81684
May 14 22:59:56 vuz-003 kernel:  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
May 14 22:59:56 vuz-003 kernel: CR0: 80050033 CR2: b85776b8 CR3: 36dfd000 CR4:
000006d0
May 14 22:59:56 vuz-003 kernel: DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3:
00000000
May 14 22:59:56 vuz-003 kernel: DR6: ffff0ff0 DR7: 00000400
May 14 22:59:56 vuz-003 kernel:  =======================

cvg-player is a user space application written in python running from an
unprivileged user.

I also have several machines which differ from the two above only in one thing:
they do not have the TV tuner board installed (Philips Semiconductors
SAA7133/SAA7135, PCI id 1131:7133 (rev d1)), and there were no problems with
those machines ever.
Comment 7 Lev Shamardin 2008-06-02 06:34:41 EDT
A new hangup,

May 31 07:52:27 vuz-003 kernel: BUG: soft lockup - CPU#0 stuck for 60s!
[ksoftirqd/0:4]
May 31 07:52:27 vuz-003 kernel: 
May 31 07:52:27 vuz-003 kernel: Pid: 4, comm: ksoftirqd/0 Tainted: P       
(2.6.24.4-64.fc8 #1)
May 31 07:52:27 vuz-003 kernel: EIP: 0060:[<c0435f59>] EFLAGS: 00000286 CPU: 0
May 31 07:52:27 vuz-003 kernel: EIP is at run_timer_softirq+0x115/0x196
May 31 07:52:27 vuz-003 kernel: EAX: c079bf00 EBX: f718c0f0 ECX: f8a909e2 EDX:
c079bfd0
May 31 07:52:27 vuz-003 kernel: ESI: f718c15c EDI: c080df80 EBP: f8a90556 ESP:
c079bfc0
May 31 07:52:27 vuz-003 kernel:  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
May 31 07:52:27 vuz-003 kernel: CR0: 8005003b CR2: 0844770d CR3: 3714f000 CR4:
000006d0
May 31 07:52:27 vuz-003 kernel: DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3:
00000000
May 31 07:52:27 vuz-003 kernel: DR6: ffff0ff0 DR7: 00000400
May 31 07:52:27 vuz-003 kernel:  [<f8a90556>] death_by_timeout+0x0/0xae
[nf_conntrack]
May 31 07:52:27 vuz-003 kernel:  [<c04335c2>] __do_softirq+0x66/0xd3
May 31 07:52:27 vuz-003 kernel:  [<c040759a>] do_softirq+0x6c/0xce
May 31 07:52:27 vuz-003 kernel:  [<c04334b8>] ksoftirqd+0x0/0xa4
May 31 07:52:27 vuz-003 kernel:  [<c04334fa>] ksoftirqd+0x42/0xa4
May 31 07:52:27 vuz-003 kernel:  [<c043edb4>] kthread+0x38/0x60
May 31 07:52:27 vuz-003 kernel:  [<c043ed7c>] kthread+0x0/0x60
May 31 07:52:27 vuz-003 kernel:  [<c0405e0b>] kernel_thread_helper+0x7/0x10
May 31 07:52:27 vuz-003 kernel:  =======================

This time without the TV tuner kernel module loaded.
Any ideas how to fix this? Are there any chances that this bug is missing in
earlier kernels?
Comment 8 Bug Zapper 2008-11-26 05:36:05 EST
This message is a reminder that Fedora 8 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 8.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '8'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 8's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 8 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 9 Bug Zapper 2009-01-09 01:26:50 EST
Fedora 8 changed to end-of-life (EOL) status on 2009-01-07. Fedora 8 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.