Bug 205147 - x86_64 smp issues revisited
Summary: x86_64 smp issues revisited
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 5
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-09-04 20:44 UTC by wolfgang pichler
Modified: 2007-11-30 22:11 UTC (History)
2 users (show)

Fixed In Version: FC6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-07-02 13:27:31 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
var/log/messages (409.22 KB, application/octet-stream)
2006-09-04 20:44 UTC, wolfgang pichler
no flags Details
dmesg until problem losing ticks (59.21 KB, application/octet-stream)
2006-09-04 20:47 UTC, wolfgang pichler
no flags Details
var/log/messages (34.73 KB, application/octet-stream)
2006-09-04 20:48 UTC, wolfgang pichler
no flags Details

Description wolfgang pichler 2006-09-04 20:44:31 UTC
INTRO :

i do fuzz around with amd amomalies for quite some years.

lkml and other lists and internet too is full of complaints, workarounds and no
definite solutions.

the most severe problem arises if interrupts on disk-io get lost and system  
freezes with

ataN: command 0xca timeout, stat 0x50 host_stat 0x4
ataN: status=0x50 { DriveReady SeekComplete }

the most clear answer given to the problem is http://lkml.org/lkml/2006/2/1/396

the next problem is, that lost ticks are reported very frequently, which does
not sound so severe, especially if the ntpd will keep the time skew in limits.

individually one might encounter sluggyisch keybord and mousebehaviour which
could be as worse that no meaningful typing was possible, either in X or in
windoze vmware-guests, which also might report "delayed write failed" errors
(esp. xp-sp2).

---

in fc4 i worked around some time to use i386 and NO smp-kernel but was really
not amused to have bought some expensive cpu and to run effectively a 2200+

so i did extensive recherche and switched to the i386 smp-kernel /w
"report_lost_ticks=100 nmi_watchdog=2"

everything was acceptable except "windows delayed write" errors.
any experiments with tso and/or other nics did not alter anything.

---

when switching to fc5 i decided to give x86_64 a try.

interestingly "report_lost_ticks=100 nmi_watchdog=2" 
was NOT sufficient to make the system NOT freeze after 5 to 90 minutes uptime.

as next step irqbalance had to be dumped which gives a strong hint that
irq-routing (and apic-code ?) is involved besides possibly blaming device drivers.

but alas, "delayed write failed" errors on vmware xp2-guests increased by one
magnitude rendering it unusable.

---

NOW I THINK THE TIME HAS COME TO DO - EH - WHAT ?

lkml yields - nothing, except

- sata and device-driver people drowning in patchmania and quirks
- acpi struggling with buggy bioses

i just observe

- smp code for amd heavily broken since fallback to uniproc making any
investment in mp useless. fallback to up for satisfactory behaviour heavily
recommended.

- no concise informations how to workaround/troubleshoot and/or no improvements
over years

---

FINALLY

as step 1 (keeping "irqpoll/irqfixup" in mind :) i wanted to get rid of the APIC
timer to begin at the beginning (the tick problem)

to my best knowledge "acpi=off noapic nolapic notsc" should do the job but did
actually not : the followin lines show up in dmesg

  Using local APIC timer interrupts.
  result 12516111
  Detected 12.516 MHz APIC timer.
  time.c: Lost 11 timer tick(s)! rip setup_boot_APIC_clock+0x11f/0x121)

not really knowing the impact i tried "disable_timer_pin_1"
and got the following log of var/log/messages for about 24 hrs showing up
increasing ticks lost and resulting in a clock skew of 5 hrs DESPITE OF ntpd
running !!!

--------------------------------------------------------------------------

PLEASE :

1) comment problem observation
2) give information on further huntdown
3) any clearance of of smp to improve expected ?

thank you

==========================================================================

ATTACHMENTS :

- var/log/messages for about 24 hrs showing up increased lost ticks

(more uploaded as additional coments)

==========================================================================

Description of problem:

yes, tried a bit

Version-Release number of selected component (if applicable):

kernel 2.6.17-1.2157_FC5
asus a8v se dluxe amd x2 4400+

How reproducible: 

"you might get an option to sell your mother-in-law if not occurs"

Steps to Reproduce:

1. boot kernel /w given kernel params

2. load the system e.g. run 2 vmware guests (win2k3srv,winxp) and load some
additonal windoze-clients on the srv

3. wait 30-60 minutes 
  
Actual results:

lost ticks and more

Expected results:

a stable reliable and WORKING system

Additional info:

see next comments

Comment 1 wolfgang pichler 2006-09-04 20:44:31 UTC
Created attachment 135512 [details]
var/log/messages

Comment 2 wolfgang pichler 2006-09-04 20:47:13 UTC
Created attachment 135513 [details]
dmesg until problem losing ticks

kernel /vmlinuz-2.6.17-1.2157_FC5 ro root=/dev/vg1/fc5_x86_64 rhgb quiet
report_lost_ticks=50 nmi_watchdog=2 acpi=off noapic nolapic notsc
disable_timer_pin_1

Comment 3 wolfgang pichler 2006-09-04 20:48:03 UTC
Created attachment 135514 [details]
var/log/messages

kernel /vmlinuz-2.6.17-1.2157_FC5 ro root=/dev/vg1/fc5_x86_64 rhgb quiet
report_lost_ticks=50 nmi_watchdog=2 acpi=off noapic nolapic notsc
disable_timer_pin_1

Comment 4 wolfgang pichler 2006-09-04 21:02:25 UTC
shows up an example dmesg between boot and start of losing ticks
(intended to show upf first lines of code suffering)

Comment 5 wolfgang pichler 2006-09-04 21:03:33 UTC
man i hate it - sorry - my first bug report here :-)

Comment 6 wolfgang pichler 2006-09-04 22:33:16 UTC
FYI : i noticed bug 181310 before redigging the issue, but i think this bug
(205147) is an extended variant of 181310 not a mere duplicate

---

i have 3 promise and 1 via controllers running :

PDC20718 (SATA 300 TX4) (rev 02)
PDC20518/PDC40518 (SATAII 150 TX4) (rev 02)
PDC20378 (FastTrak 378/SATA 378) (rev 02) -------- onboard unused
VIA VT6420 SATA RAID Controller (rev 80)  -------- onboard

the misrouted/lost interrupts ALSO show up on VIA, i cannot prove but think here
first

again : not using irqbalance seems to me a strong hint, that all this stuff is
not just a sata_promise issue

if you like i can test some config-variants /w kernel-sources to get at least
some light into this shaddy kernel corner ...

---

if you still think this bug it is just a mere dup of 181310 please mark it so


Comment 7 wolfgang pichler 2006-09-05 01:52:27 UTC
after kernel-src-browsing & additonal heuristics 

"disable_timer_pin_1" is complete nonsense and is not supported by any __setup()
but seems some outdated feature still hanging around in kernel doc
"noapictimer" seems better suited for experimenting - it starves the system
while booting :-)

"notsc" and "enable_8254_timer" seem to have the same effect regarding timers :
(io-)apic seems to be used anyway for accessing ticking in x64_86 - or am i
missing something (i am definitely no hw-expert :-)

i found thus far no way to get rid of 'rip'-s

regards w.

Comment 8 David Lawrence 2006-09-05 15:22:30 UTC
Reassigning to correct owner, kernel-maint.

Comment 9 Dave Jones 2006-10-16 19:12:00 UTC
A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 10 wolfgang pichler 2006-10-17 12:29:02 UTC
ok, will now switch to 2.6.18-1.2200.

2.6.18-1.2189 results : no change :
- use of serial console yields unbootable so no detailed anylyzing was possible

Comment 11 wolfgang pichler 2007-03-26 09:14:18 UTC
please close bug - systems updated to fc6 - no use to follow here
new announcments for revisited timer code in 2.6.20

Comment 12 Peter van Egdom 2007-07-02 13:27:31 UTC
Closing bug. See comment #11. Timer code is revised in later kernels 
(2.6.20+).


Note You need to log in before you can comment on or make changes to this bug.