Bug 243083

Summary: soft lockup detected on CPU#0 and CPU#1
Product: [Fedora] Fedora Reporter: Bernard Fouché <bernard.fouche>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: low    
Version: 7CC: matteo, matt
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-08-29 18:38:46 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dmesg trace none

Description Bernard Fouché 2007-06-07 08:58:06 UTC
Description of problem:

The following traces appear in /var/log/messages with F7 kernel (2.6.21-1.3194)
for a Dell Precision 370 (P4 3Ghz step1):

Trace 1:
Jun  6 14:21:46 linuxbf kernel: BUG: soft lockup detected on CPU#0!
Jun  6 14:21:46 linuxbf kernel:  [<c0451f3e>] softlockup_tick+0xa5/0xb4
Jun  6 14:21:46 linuxbf kernel:  [<c042e930>] update_process_times+0x3b/0x5e
Jun  6 14:21:46 linuxbf kernel:  [<c043d2bd>] tick_sched_timer+0x78/0xbb
Jun  6 14:21:46 linuxbf kernel:  [<c0439df5>] hrtimer_interrupt+0x12b/0x1b6
Jun  6 14:21:46 linuxbf kernel:  [<c043d245>] tick_sched_timer+0x0/0xbb
Jun  6 14:21:46 linuxbf kernel:  [<c05b8578>] rt_check_expire+0x0/0x158
Jun  6 14:21:46 linuxbf kernel:  [<c0419c40>] smp_apic_timer_interrupt+0x6f/0x80
Jun  6 14:21:46 linuxbf kernel:  [<c04059bc>] apic_timer_interrupt+0x28/0x30
Jun  6 14:21:46 linuxbf kernel:  [<c05b8578>] rt_check_expire+0x0/0x158
Jun  6 14:21:46 linuxbf kernel:  [<c042007b>] find_busiest_group+0x207/0x4c5
Jun  6 14:21:46 linuxbf kernel:  [<c042dcee>] run_timer_softirq+0x10a/0x17b
Jun  6 14:21:46 linuxbf kernel:  [<c05b8578>] rt_check_expire+0x0/0x158
Jun  6 14:21:46 linuxbf kernel:  [<c042b2e5>] __do_softirq+0x5d/0xba
Jun  6 14:21:46 linuxbf kernel:  [<c04071b7>] do_softirq+0x59/0xb1
Jun  6 14:21:46 linuxbf kernel:  [<c042b1c7>] ksoftirqd+0x0/0xc1
Jun  6 14:21:46 linuxbf kernel:  [<c042b226>] ksoftirqd+0x5f/0xc1
Jun  6 14:21:46 linuxbf kernel:  [<c0436da8>] kthread+0xb0/0xd8
Jun  6 14:21:46 linuxbf kernel:  [<c0436cf8>] kthread+0x0/0xd8
Jun  6 14:21:46 linuxbf kernel:  [<c0405b3f>] kernel_thread_helper+0x7/0x10
Jun  6 14:21:46 linuxbf kernel:  =======================
Jun  6 14:49:50 linuxbf kernel: BUG: soft lockup detected on CPU#1!
Jun  6 14:49:50 linuxbf kernel:  [<c0451f3e>] softlockup_tick+0xa5/0xb4
Jun  6 14:49:50 linuxbf kernel:  [<c042e930>] update_process_times+0x3b/0x5e
Jun  6 14:49:50 linuxbf kernel:  [<c043d2bd>] tick_sched_timer+0x78/0xbb
Jun  6 14:49:50 linuxbf kernel:  [<c0439df5>] hrtimer_interrupt+0x12b/0x1b6
Jun  6 14:49:50 linuxbf kernel:  [<c043d245>] tick_sched_timer+0x0/0xbb
Jun  6 14:49:50 linuxbf kernel:  [<c0419c40>] smp_apic_timer_interrupt+0x6f/0x80
Jun  6 14:49:50 linuxbf kernel:  [<c04059bc>] apic_timer_interrupt+0x28/0x30
Jun  6 14:49:50 linuxbf kernel:  =======================


Trace 2:
Jun  6 21:44:32 linuxbf kernel: Clocksource tsc unstable (delta = 501984757941 ns)
Jun  6 21:44:32 linuxbf ntpd[1985]: synchronized to LOCAL(0), stratum 10
Jun  6 21:44:32 linuxbf kernel: Time: hpet clocksource has been installed.
Jun  6 21:44:32 linuxbf kernel: BUG: soft lockup detected on CPU#0!
Jun  6 21:44:32 linuxbf kernel:  [<c0451f3e>] softlockup_tick+0xa5/0xb4
Jun  6 21:44:32 linuxbf kernel:  [<c042e930>] update_process_times+0x3b/0x5e
Jun  6 21:44:32 linuxbf kernel:  [<c043d2bd>] tick_sched_timer+0x78/0xbb
Jun  6 21:44:32 linuxbf kernel:  [<c0439df5>] hrtimer_interrupt+0x12b/0x1b6
Jun  6 21:44:32 linuxbf kernel:  [<c043d245>] tick_sched_timer+0x0/0xbb
Jun  6 21:44:32 linuxbf kernel:  [<c05c3634>] inet_twdr_hangman+0x0/0x94
Jun  6 21:44:32 linuxbf kernel:  [<c0419c40>] smp_apic_timer_interrupt+0x6f/0x80
Jun  6 21:44:32 linuxbf kernel:  [<c042e863>] __mod_timer+0xa1/0xab
Jun  6 21:44:32 linuxbf kernel:  [<c04059bc>] apic_timer_interrupt+0x28/0x30
Jun  6 21:44:32 linuxbf kernel:  [<c05c3634>] inet_twdr_hangman+0x0/0x94
Jun  6 21:44:32 linuxbf kernel:  [<c042007b>] find_busiest_group+0x207/0x4c5
Jun  6 21:44:32 linuxbf kernel:  [<c042dcee>] run_timer_softirq+0x10a/0x17b
Jun  6 21:44:32 linuxbf kernel:  [<c05c3634>] inet_twdr_hangman+0x0/0x94
Jun  6 21:44:32 linuxbf kernel:  [<c042a588>] it_real_fn+0x12/0x16
Jun  6 21:44:32 linuxbf kernel:  [<c042b2e5>] __do_softirq+0x5d/0xba
Jun  6 21:44:32 linuxbf kernel:  [<c04071b7>] do_softirq+0x59/0xb1
Jun  6 21:44:32 linuxbf kernel:  [<c042b1c7>] ksoftirqd+0x0/0xc1
Jun  6 21:44:32 linuxbf kernel:  [<c042b226>] ksoftirqd+0x5f/0xc1
Jun  6 21:44:32 linuxbf kernel:  [<c0436da8>] kthread+0xb0/0xd8
Jun  6 21:44:32 linuxbf kernel:  [<c0436cf8>] kthread+0x0/0xd8
Jun  6 21:44:32 linuxbf kernel:  [<c0405b3f>] kernel_thread_helper+0x7/0x10
Jun  6 21:44:32 linuxbf kernel:  =======================
Jun  6 21:44:32 linuxbf kernel: sd 0:0:0:0: SCSI error: return code = 0x06000000
Jun  6 21:44:32 linuxbf kernel: end_request: I/O error, dev sda, sector 32452941
Jun  6 21:44:32 linuxbf kernel: EXT3-fs error (device dm-0): read_block_bitmap:
Cannot read block bitmap - block_group = 123, block_bitmap = 4030464
Jun  6 21:44:32 linuxbf kernel: BUG: soft lockup detected on CPU#1!
Jun  6 21:44:32 linuxbf kernel:  [<c0451f3e>] softlockup_tick+0xa5/0xb4
Jun  6 21:44:32 linuxbf kernel:  [<c042e930>] update_process_times+0x3b/0x5e
Jun  6 21:44:32 linuxbf kernel:  [<c043d2bd>] tick_sched_timer+0x78/0xbb
Jun  6 21:44:32 linuxbf kernel:  [<c0439df5>] hrtimer_interrupt+0x12b/0x1b6
Jun  6 21:44:32 linuxbf kernel:  [<c043d245>] tick_sched_timer+0x0/0xbb
Jun  6 21:44:32 linuxbf kernel:  [<c0419c40>] smp_apic_timer_interrupt+0x6f/0x80
Jun  6 21:44:32 linuxbf kernel:  [<c04059bc>] apic_timer_interrupt+0x28/0x30
Jun  6 21:44:32 linuxbf kernel:  [<c043007b>] do_notify_parent+0xf1/0x154
Jun  6 21:44:32 linuxbf kernel:  [<c0403281>] mwait_idle_with_hints+0x3b/0x3f
Jun  6 21:44:32 linuxbf kernel:  [<c04033d6>] cpu_idle+0xa3/0xc4
Jun  6 21:44:32 linuxbf kernel:  =======================

Version-Release number of selected component (if applicable):


How reproducible:

Computer was running latest kernel for FC5 without any problems until upgraded
to F7. Then this bug appeared twice as show above. System is also polluted with
bug #240982. System is now unstable and hang a few times or becomes to crawl.
Fix badly needed.

Comment 1 Chuck Ebbert 2007-06-07 16:16:10 UTC
please try kernel parameter
    clocksource=acpi_pm


Comment 2 Bernard Fouché 2007-06-07 17:29:24 UTC
What I did:

- after my bug report, I stayed in non-hyperthreading mode (set in bios), had
the previous reported kernel traces in /var/log/messages but did not experience
any problem with the computer for a full days of work (many cross-compilations,
find(1) for /, etc): no crawling, no crash.

- read your query, set 'clocksource=acpi_pm' in /etc/grub.conf, used the bios to
re-activate hyperthreading. System booted finely. I can see in /var/log/messages:

Jun  7 19:18:21 linuxbf kernel: Time: acpi_pm clocksource has been installed.

(times are local time in France)

Now the system works correctly, but I must go back home. I'll let the computer
run and report further problems (or lack of!) (FYI Bug #240982 is still present.)

Thanks.

Comment 3 Bernard Fouché 2007-06-08 07:32:22 UTC
This morning the computer was still running. However it was much less responsive
than yesterday when I rebooted it. I started to fill the bugzilla form while the
computer was crawling more and more until it froze. The only solution was to
switch it off/on.

I went back to single core operation (thru bios) and now it works correctly.

Having added 'clocksource=acpi_pm' got rid of 'soft lockup' messages in
/var/log/messages.

I'll report later, but I think that the freezing problem is linked to bug
#240982 and not the present one which has vanished with the 'clocksource' statement.

Comment 4 Bernard Fouché 2007-06-08 15:13:10 UTC
Using 'clocksource=acpi_pm' and hyperthreading disabled for many hours now, I
still have no more "soft lockup" nor crawling.

Comment 5 Bernard Fouché 2007-06-11 08:13:20 UTC
Computer ran for a week-end without any problem. IMHO 'clocksource=acpi_pm'
fixed the problem.

Comment 6 Matt Darcy 2007-06-11 08:47:32 UTC
are you still running without hyperthreading enabled ?
Could someone explain this parameter in basic detail please I can't find into 
on it.


Comment 7 Matt Darcy 2007-06-11 10:59:16 UTC
when I say this parameter I mean the 'clocksource=acpi_pm' option.

Comment 8 Bernard Fouché 2007-06-11 15:32:47 UTC
Yes I'm still in single core mode, set thru bios. I wait for a fix for bug
#240982 before trying dual core mode again: I can't afford freezes these days.


Comment 9 Bernard Fouché 2007-06-14 08:30:41 UTC
Went back to hyperthreading set thru bios. Dropped parameter
'clocksource=acpi_pm'. Updated kernel to 2.6.21-1.3228. No more 'soft lockup' in
/var/log/messages 40 minutes after reboot. Will report later if this error
message is back.

Comment 10 Bernard Fouché 2007-06-14 09:39:12 UTC
Computer froze after one hour. No particular output in /var/log/messages. Was
unable to ssh in the computer, lost the mouse pointer when hit ctrl-alt-f1 while
trying to get a text terminal. Went back to F7 original kernel, no dual core,
clocksource=acpi_pm. Will retry later when I can afford to lose some more time...

Comment 11 Matteo Corti 2007-06-14 13:06:50 UTC
I am experiencing the same problem but only if I run my folding@home client. I
get the messages in /var/log/messages and I cannot start new processes (but all
the running one are fine). If I kill the folding@home client everything is fine.

Comment 12 Matteo Corti 2007-06-14 13:10:15 UTC
Created attachment 156990 [details]
dmesg trace

This is a snippet of the kernel messages. I will maybe try to change the
clocksource when I'll reboot the machine.

Comment 13 Bernard Fouché 2007-06-14 13:50:58 UTC
I'm running folding@home also. Thanks to have pointed it that to me. If there is
a problem at a low level on threads or sockets, then F@H may activate a yet
unknown bug! I'll try later to reboot with 3228 and no F@H.

Comment 14 Matteo Corti 2007-06-14 13:54:12 UTC
I can confirm the same behavior with 3228.

Comment 15 Bernard Fouché 2007-06-14 17:11:16 UTC
Went back to hyperthreading with kernel 3228, no clocksource parameter. I leave
the office now and won't be back until Tuesday. I let the computer idling
without folding@home.

Comment 16 Bernard Fouché 2007-06-20 16:29:05 UTC
Did not need to reboot since friday with kernel 3228 (computer running 6 days).
No more softlockup message. This kernel is fine for me, but I did not try it
with F@H. Time to close this bug report?