Red Hat Bugzilla – Bug 186856
System slows down & kjournald in D state
Last modified: 2007-11-30 17:11:28 EST
Description of problem:
After a day or two up time, FC5 with kernel 2.6.15-1.2054_FC5 slows down, so
much so that the system is unuseable. Many kjournald processes are observed to
be in D state.
Version-Release number of selected component (if applicable):
Easily. Always. After boot up wait for a day or two.
Steps to Reproduce:
3. Let's stop responding
Thing simply stop functioning. Simple commands like login procedure, ls etc.
take painfully a long time to complete. X is unuseable.
System not to stop responding.
Refer to attachment dmesg.txt which has Alt+SysRq+(P|T) etc.. (SysRq+T may only
be partially complete.) Observe a few kjournald (ext3 kernel threads) in D
state, which is an obvious culprit.
Created attachment 126781 [details]
dmesg with D state processes
I stand corrected. It doesn't take a day or two to become unusable, just a few
Also, it's an FC4-to-FC5 upgraded system. (IOW, all current ext3 file systems
are FC4 created.) Not sure, if there's any relevance though.
I'll verify whether Linus's kernel exhibits this behaviour.
In fact, 2.6.15-1.2054_FC5 did recover from the kjournald D state processes,
albeit after a long wait magically. Kernel.org's 188.8.131.52, after a day of
uptime, hasn't exhibited these symptoms.
I've updated the system to newly updated/released 2.6.16-1.2080_FC5 today. Will
report in a few days how this one goes.
Finally, IMHO, it'd be a good idea if somebody could analyse those D state
processes & see if 2.6.16-1.2080_FC5 might suffer from the same problems.
Created attachment 127085 [details]
dmesg with Sysrq-(M|P|T)
Ref to previous attachment (id=127085 'dmesg with Sysrq-(M|P|T)'), after a day
of uptime, 2.6.16-1.2080_FC5 suffers from the same problems: unusably slow, time
is inaccurate (10 hrs behind) etc.
A few user processes (auditd, syslogd, udevd etc.) & all kjournald, EXT3 kernel
threads, are in D state.
If & when it recovers on its own, I'll test 184.108.40.206 to isolate if this an FC5
specific problem. If it isn't, then I'll escalate to LKML.
PS: Because bugzilla ate my comments I typed before the attachment, I had to
retype it, & hence now it looks reversed: attachment first, comments next. :-(
After a week of uptime with no problems (no slowness, no D state processes) on
kernel.org's 220.127.116.11, it's safe to conclude that this problem might be FC5
specific, assuming .config of kernel.org's kernel closely resembles that of FC5.
So, that's the next step: to have all kernel modules FC5 loads configured in the
kernel.org's kernel (which closely resembles FC5 anyway, so as to silence all
noicy start-up scripts' annoying error/warning messages :-)). On request, my
.config shall be provided.
Actually, 18.104.22.168 with FC5's .config has the same problem. I've reported it to
It turns out CONFIG_X86_UP_IOAPIC is the source problem. Disabling it makes the
problem go away. A few IBM NetVista (P-IV) machines are exhibiting this symptom.
We are using "noapic" kernel boot parameter to see if all is well.
In regards to the IBM NetVista machines, have you looked to see if there is an
updated BIOS for them?
Using 'noapic' option is the correct mechanism for disabling IO APIC support on
Thanks Konrad. Yes indeed, I'm using the most recent BIOS update available
(i.e., 5 Aug 2005).
Other observations: Windows XP uses no APIC either, just XT-PIC (observed
through winmsd & device manager). Incidently, Windows XP does use APIC on an
older NetVista (P-III) model however.
A couple of logical conclusion I could make (perhaps wrongly):
1. It'd seem XP knows not uses APIC on this model (P-IV), though it uses it on
older model (P-III).
2. APIC (Linux or Windows) isn't a widely tested mode in this model. (Let's face
it, world domination is far far ahead yet :-)
(In reply to comment #11)
> Other observations: Windows XP uses no APIC either, just XT-PIC (observed
> through winmsd & device manager). Incidently, Windows XP does use APIC on an
> older NetVista (P-III) model however.
> A couple of logical conclusion I could make (perhaps wrongly):
> 1. It'd seem XP knows not uses APIC on this model (P-IV), though it uses it on
> older model (P-III).
If you see, XT-PIC, that means the kernel is using the legacy IRQ handler. The
legacy IRQ handler utilizes the 8259A chipset, which is in any PC machine - it
is limited to small number of interrupts, does not allow to route IRQs to
different CPUs (which IO APIC and Local APIC allow).
> 2. APIC (Linux or Windows) isn't a widely tested mode in this model. (Let's face
> it, world domination is far far ahead yet :-)
Part of this problem might be that your machine is hammered with timer
interrupts and spends most of its time handling those requests.
If you have time (and the desire), see (and just post the results in this bug)
the contents of /proc/interrupts with and without the "noapic" bootup parameter.
Also the model you have is old - and its implementation of APIC in the BIOS,
along with how it is actually wired, might lead to the interrupt hammer problem.
The "fix" that I work on putting in FC (and main-line) is to blacklist this
model so that 'noapic' bootup option will always be choosen on this particular
machine. For that purpose I would need the result of 'dmidecode'. If you could
provide the output of running that command it would be much appreciated.
Completely agree that the timer interrupts are getting lost when APIC is used
(when the machine slows down, it's easily proven with sleep 1, which takes
around 50 seconds to return).
The dmidecode output is attached as "dmidecode-netvista-8305-42A.txt".
PS: Interestingly, "dmidecode > foo.txt" doesn't work (thro strace the output is
proven to go to fd 1), while "dmidecode | cat > foo.txt" does. Strange.
Created attachment 128826 [details]
DMI decode of IBM NetVista 8305-42A desktop
Sorry for taking so long. Please try this kernel without the 'noapic' parameter.
comment #8 refers to "noapic". I'm not familiar with it. where can I found more
info regarding that? does it have anything to do with ACPI?
I also have 5 IBM Netvista model 8305 which exhibiting similar problems. I have
disabled APM (Advanced Power Management) from inside the BIOS. It only has been
24 hours and no problems yet.
When the problem shows up, one can do multiple "date" command and see the system
clock keep going back and forth.
The 'noapic' is an option to turn of using the IO APIC for interrupts and
instead using the legacy PIC controller.
Did you by chance use the test kernel I posted in comment #15?
A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.
Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.
This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.
Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.
In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed. See bug 207474 for further details.
If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.
If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.
Please test that kernel before moving to FC6. The message in comment #18 is a
canned one for all the bugs that have FC5 in them.
Just install the kernel using the rpm command. If it complains, do a --force.
The --force won't overwrite anything as the kernel you are installing is
different than the stock one. Please boot it up without the 'noapic' parameter
and see if it works and attach the 'dmesg' file to this BZ.
You can remove the kernel safely afterwards by doing 'rpm -e <kernel>'.
Just tried it (kernel in #19) on my NetVista 8305. No change. after one day of
operation, the system was noticeably slow and the system clock was almost 12
Reverting back to the previous kernel and use of "noapic" ;-)
Can you include the output of dmidecode in this BZ pls? Thank you.
closing BZ as WONTFIX.
The workaround is to supply "noapic" as a bootup argument until the BIOS team
comes up with a fix.