Bug 186856

Summary: System slows down & kjournald in D state
Product: [Fedora] Fedora Reporter: Srihari Vijayaraghavan <noldoli>
Component: kernelAssignee: Konrad Rzeszutek <konradr>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 5CC: cschreib, sobhi, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-04-23 17:52:26 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dmesg with D state processes
none
dmesg with Sysrq-(M|P|T)
none
DMI decode of IBM NetVista 8305-42A desktop none

Description Srihari Vijayaraghavan 2006-03-27 03:31:14 UTC
Description of problem:
After a day or two up time, FC5 with kernel 2.6.15-1.2054_FC5 slows down, so
much so that the system is unuseable. Many kjournald processes are observed to
be in D state. 

Version-Release number of selected component (if applicable):
2.6.15-1.2054_FC5

How reproducible:
Easily. Always. After boot up wait for a day or two.

Steps to Reproduce:
1. Boot
2. Wait
3. Let's stop responding

Actual results:
Thing simply stop functioning. Simple commands like login procedure, ls etc.
take painfully a long time to complete. X is unuseable.

Expected results:
System not to stop responding.

Additional info:
Refer to attachment dmesg.txt which has Alt+SysRq+(P|T) etc.. (SysRq+T may only
be partially complete.) Observe a few kjournald (ext3 kernel threads) in D
state, which is an obvious culprit.

Comment 1 Srihari Vijayaraghavan 2006-03-27 03:31:14 UTC
Created attachment 126781 [details]
dmesg with D state processes

Comment 2 Srihari Vijayaraghavan 2006-03-27 22:36:03 UTC
I stand corrected. It doesn't take a day or two to become unusable, just a few
hours. :-(

Also, it's an FC4-to-FC5 upgraded system. (IOW, all current ext3 file systems
are FC4 created.) Not sure, if there's any relevance though.

I'll verify whether Linus's kernel exhibits this behaviour.

Thanks

Comment 3 Srihari Vijayaraghavan 2006-03-29 22:41:40 UTC
In fact, 2.6.15-1.2054_FC5 did recover from the kjournald D state processes,
albeit after a long wait magically. Kernel.org's 2.6.16.1, after a day of
uptime, hasn't exhibited these symptoms.

I've updated the system to newly updated/released 2.6.16-1.2080_FC5 today. Will
report in a few days how this one goes.

Finally, IMHO, it'd be a good idea if somebody could analyse those D state
processes & see if 2.6.16-1.2080_FC5 might suffer from the same problems.

Thanks

Comment 4 Srihari Vijayaraghavan 2006-03-30 23:20:00 UTC
Created attachment 127085 [details]
dmesg with Sysrq-(M|P|T)

Comment 5 Srihari Vijayaraghavan 2006-03-30 23:30:35 UTC
Ref to previous attachment (id=127085 'dmesg with Sysrq-(M|P|T)'), after a day
of uptime, 2.6.16-1.2080_FC5 suffers from the same problems: unusably slow, time
is inaccurate (10 hrs behind) etc.

A few user processes (auditd, syslogd, udevd etc.) & all kjournald, EXT3 kernel
threads, are in D state.

If & when it recovers on its own, I'll test 2.6.16.1 to isolate if this an FC5
specific problem. If it isn't, then I'll escalate to LKML.

Thanks

PS: Because bugzilla ate my comments I typed before the attachment, I had to
retype it, & hence now it looks reversed: attachment first, comments next. :-(

Comment 6 Srihari Vijayaraghavan 2006-04-06 23:09:01 UTC
After a week of uptime with no problems (no slowness, no D state processes) on
kernel.org's 2.6.16.1, it's safe to conclude that this problem might be FC5
specific, assuming .config of kernel.org's kernel closely resembles that of FC5.

So, that's the next step: to have all kernel modules FC5 loads configured in the
kernel.org's kernel (which closely resembles FC5 anyway, so as to silence all
noicy start-up scripts' annoying error/warning messages :-)). On request, my
.config shall be provided.

Thanks

Comment 7 Srihari Vijayaraghavan 2006-04-18 04:07:22 UTC
Actually, 2.6.16.1 with FC5's .config has the same problem. I've reported it to
LKML (http://marc.theaimsgroup.com/?l=linux-kernel&m=114531784901024&w=2).

Thanks

Comment 8 Srihari Vijayaraghavan 2006-05-08 00:27:15 UTC
It turns out CONFIG_X86_UP_IOAPIC is the source problem. Disabling it makes the
problem go away. A few IBM NetVista (P-IV) machines are exhibiting this symptom.
We are using "noapic" kernel boot parameter to see if all is well.

Comment 9 Konrad Rzeszutek 2006-05-08 14:35:22 UTC
Srihari,

In regards to the IBM NetVista machines, have you looked to see if there is an
updated BIOS for them?

Using 'noapic' option is the correct mechanism for disabling IO APIC support on
some machines. 

Comment 10 Srihari Vijayaraghavan 2006-05-09 00:43:10 UTC
Thanks Konrad. Yes indeed, I'm using the most recent BIOS update available
(i.e., 5 Aug 2005).

Comment 11 Srihari Vijayaraghavan 2006-05-09 00:58:38 UTC
Other observations: Windows XP uses no APIC either, just XT-PIC (observed
through winmsd & device manager). Incidently, Windows XP does use APIC on an
older NetVista (P-III) model however.

A couple of logical conclusion I could make (perhaps wrongly):
1. It'd seem XP knows not uses APIC on this model (P-IV), though it uses it on
older model (P-III).
2. APIC (Linux or Windows) isn't a widely tested mode in this model. (Let's face
it, world domination is far far ahead yet :-)

Thanks

Comment 12 Konrad Rzeszutek 2006-05-09 14:37:46 UTC
(In reply to comment #11)
> Other observations: Windows XP uses no APIC either, just XT-PIC (observed
> through winmsd & device manager). Incidently, Windows XP does use APIC on an
> older NetVista (P-III) model however.
> 
> A couple of logical conclusion I could make (perhaps wrongly):
> 1. It'd seem XP knows not uses APIC on this model (P-IV), though it uses it on
> older model (P-III).

If you see, XT-PIC, that means the kernel is using the legacy IRQ handler. The
legacy IRQ handler utilizes the 8259A chipset, which is in any PC machine - it
is limited to small number of interrupts, does not allow to route IRQs to
different CPUs (which IO APIC and Local APIC allow).

> 2. APIC (Linux or Windows) isn't a widely tested mode in this model. (Let's face
> it, world domination is far far ahead yet :-)
> 

Part of this problem might be that your machine is hammered with timer
interrupts and spends most of its time handling those requests. 

If you have time (and the desire), see (and just post the results in this bug)
the contents of /proc/interrupts with and without the "noapic" bootup parameter.

Also the model you have is old - and its implementation of APIC in the BIOS,
along with how it is actually wired, might lead to the interrupt hammer problem.

The "fix" that I work on putting in FC (and main-line) is to blacklist this
model so that 'noapic' bootup option will always be choosen on this particular
machine. For that purpose I would need the result of 'dmidecode'. If you could
provide the output of running that command it would be much appreciated.

Thanks

Comment 13 Srihari Vijayaraghavan 2006-05-09 23:51:17 UTC
Completely agree that the timer interrupts are getting lost when APIC is used
(when the machine slows down, it's easily proven with sleep 1, which takes
around 50 seconds to return).

The dmidecode output is attached as "dmidecode-netvista-8305-42A.txt".

Thanks

PS: Interestingly, "dmidecode > foo.txt" doesn't work (thro strace the output is
proven to go to fd 1), while "dmidecode | cat > foo.txt" does. Strange.

Comment 14 Srihari Vijayaraghavan 2006-05-09 23:54:53 UTC
Created attachment 128826 [details]
DMI decode of IBM NetVista 8305-42A desktop

Comment 15 Konrad Rzeszutek 2006-07-13 14:23:01 UTC
Srihari,

Sorry for taking so long. Please try this kernel without the 'noapic' parameter.

http://darnok.com/kernels/kernel-2.6.15-1.2054_bz186856_FC5.brewbuilder.i686.rpm

Thanks.

Comment 16 Ali Sobhi 2006-09-07 03:19:28 UTC
comment #8 refers to "noapic". I'm not familiar with it. where can I found more
info regarding that? does it have anything to do with ACPI?
I also have 5 IBM Netvista model 8305 which exhibiting similar problems. I have
disabled APM (Advanced Power Management) from inside the BIOS. It only has been
24 hours and no problems yet.
When the problem shows up, one can do multiple "date" command and see the system
clock keep going back and forth.

Comment 17 Konrad Rzeszutek 2006-09-07 14:51:39 UTC
Ali,

The 'noapic' is an option to turn of using the IO APIC for interrupts and
instead using the legacy PIC controller.

Did you by chance use the test kernel I posted in comment #15?



Comment 18 Dave Jones 2006-10-17 00:08:10 UTC
A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 19 Konrad Rzeszutek 2006-10-17 15:08:36 UTC
http://darnok.com/kernels/kernel-2.6.17-1.2174_bz196856_FC5.i686.rpm

Please test that kernel before moving to FC6. The message in comment #18 is a
canned one for all the bugs that have FC5 in them.

Just install the kernel using the rpm command. If it complains, do a --force.
The --force won't overwrite anything as the kernel you are installing is
different than the stock one. Please boot it up without the 'noapic' parameter
and see if it works and attach the 'dmesg' file to this BZ.

You can remove the kernel safely afterwards by doing 'rpm -e <kernel>'.

Thanks.

Comment 20 Ali Sobhi 2006-10-18 19:16:12 UTC
Just tried it (kernel in #19) on my NetVista 8305. No change. after one day of
operation, the system was noticeably slow and the system clock was almost 12
hours behind.
Reverting back to the previous kernel and use of "noapic" ;-)

Comment 21 Konrad Rzeszutek 2006-10-18 19:42:22 UTC
Ali,
Can you include the output of dmidecode in this BZ pls? Thank you.

Comment 22 Konrad Rzeszutek 2007-04-23 17:52:26 UTC
closing BZ as WONTFIX. 

The workaround is to supply "noapic" as a bootup argument until the BIOS team
comes up with a fix.