Bug 186856
Summary: | System slows down & kjournald in D state | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Srihari Vijayaraghavan <noldoli> | ||||||||
Component: | kernel | Assignee: | Konrad Rzeszutek <konradr> | ||||||||
Status: | CLOSED WONTFIX | QA Contact: | Brian Brock <bbrock> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 5 | CC: | cschreib, sobhi, wtogami | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | i686 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2007-04-23 17:52:26 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Srihari Vijayaraghavan
2006-03-27 03:31:14 UTC
Created attachment 126781 [details]
dmesg with D state processes
I stand corrected. It doesn't take a day or two to become unusable, just a few hours. :-( Also, it's an FC4-to-FC5 upgraded system. (IOW, all current ext3 file systems are FC4 created.) Not sure, if there's any relevance though. I'll verify whether Linus's kernel exhibits this behaviour. Thanks In fact, 2.6.15-1.2054_FC5 did recover from the kjournald D state processes, albeit after a long wait magically. Kernel.org's 2.6.16.1, after a day of uptime, hasn't exhibited these symptoms. I've updated the system to newly updated/released 2.6.16-1.2080_FC5 today. Will report in a few days how this one goes. Finally, IMHO, it'd be a good idea if somebody could analyse those D state processes & see if 2.6.16-1.2080_FC5 might suffer from the same problems. Thanks Created attachment 127085 [details]
dmesg with Sysrq-(M|P|T)
Ref to previous attachment (id=127085 'dmesg with Sysrq-(M|P|T)'), after a day of uptime, 2.6.16-1.2080_FC5 suffers from the same problems: unusably slow, time is inaccurate (10 hrs behind) etc. A few user processes (auditd, syslogd, udevd etc.) & all kjournald, EXT3 kernel threads, are in D state. If & when it recovers on its own, I'll test 2.6.16.1 to isolate if this an FC5 specific problem. If it isn't, then I'll escalate to LKML. Thanks PS: Because bugzilla ate my comments I typed before the attachment, I had to retype it, & hence now it looks reversed: attachment first, comments next. :-( After a week of uptime with no problems (no slowness, no D state processes) on kernel.org's 2.6.16.1, it's safe to conclude that this problem might be FC5 specific, assuming .config of kernel.org's kernel closely resembles that of FC5. So, that's the next step: to have all kernel modules FC5 loads configured in the kernel.org's kernel (which closely resembles FC5 anyway, so as to silence all noicy start-up scripts' annoying error/warning messages :-)). On request, my .config shall be provided. Thanks Actually, 2.6.16.1 with FC5's .config has the same problem. I've reported it to LKML (http://marc.theaimsgroup.com/?l=linux-kernel&m=114531784901024&w=2). Thanks It turns out CONFIG_X86_UP_IOAPIC is the source problem. Disabling it makes the problem go away. A few IBM NetVista (P-IV) machines are exhibiting this symptom. We are using "noapic" kernel boot parameter to see if all is well. Srihari, In regards to the IBM NetVista machines, have you looked to see if there is an updated BIOS for them? Using 'noapic' option is the correct mechanism for disabling IO APIC support on some machines. Thanks Konrad. Yes indeed, I'm using the most recent BIOS update available (i.e., 5 Aug 2005). Other observations: Windows XP uses no APIC either, just XT-PIC (observed through winmsd & device manager). Incidently, Windows XP does use APIC on an older NetVista (P-III) model however. A couple of logical conclusion I could make (perhaps wrongly): 1. It'd seem XP knows not uses APIC on this model (P-IV), though it uses it on older model (P-III). 2. APIC (Linux or Windows) isn't a widely tested mode in this model. (Let's face it, world domination is far far ahead yet :-) Thanks (In reply to comment #11) > Other observations: Windows XP uses no APIC either, just XT-PIC (observed > through winmsd & device manager). Incidently, Windows XP does use APIC on an > older NetVista (P-III) model however. > > A couple of logical conclusion I could make (perhaps wrongly): > 1. It'd seem XP knows not uses APIC on this model (P-IV), though it uses it on > older model (P-III). If you see, XT-PIC, that means the kernel is using the legacy IRQ handler. The legacy IRQ handler utilizes the 8259A chipset, which is in any PC machine - it is limited to small number of interrupts, does not allow to route IRQs to different CPUs (which IO APIC and Local APIC allow). > 2. APIC (Linux or Windows) isn't a widely tested mode in this model. (Let's face > it, world domination is far far ahead yet :-) > Part of this problem might be that your machine is hammered with timer interrupts and spends most of its time handling those requests. If you have time (and the desire), see (and just post the results in this bug) the contents of /proc/interrupts with and without the "noapic" bootup parameter. Also the model you have is old - and its implementation of APIC in the BIOS, along with how it is actually wired, might lead to the interrupt hammer problem. The "fix" that I work on putting in FC (and main-line) is to blacklist this model so that 'noapic' bootup option will always be choosen on this particular machine. For that purpose I would need the result of 'dmidecode'. If you could provide the output of running that command it would be much appreciated. Thanks Completely agree that the timer interrupts are getting lost when APIC is used (when the machine slows down, it's easily proven with sleep 1, which takes around 50 seconds to return). The dmidecode output is attached as "dmidecode-netvista-8305-42A.txt". Thanks PS: Interestingly, "dmidecode > foo.txt" doesn't work (thro strace the output is proven to go to fd 1), while "dmidecode | cat > foo.txt" does. Strange. Created attachment 128826 [details]
DMI decode of IBM NetVista 8305-42A desktop
Srihari, Sorry for taking so long. Please try this kernel without the 'noapic' parameter. http://darnok.com/kernels/kernel-2.6.15-1.2054_bz186856_FC5.brewbuilder.i686.rpm Thanks. comment #8 refers to "noapic". I'm not familiar with it. where can I found more info regarding that? does it have anything to do with ACPI? I also have 5 IBM Netvista model 8305 which exhibiting similar problems. I have disabled APM (Advanced Power Management) from inside the BIOS. It only has been 24 hours and no problems yet. When the problem shows up, one can do multiple "date" command and see the system clock keep going back and forth. Ali, The 'noapic' is an option to turn of using the IO APIC for interrupts and instead using the legacy PIC controller. Did you by chance use the test kernel I posted in comment #15? A new kernel update has been released (Version: 2.6.18-1.2200.fc5) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. In the last few updates, some users upgrading from FC4->FC5 have reported that installing a kernel update has left their systems unbootable. If you have been affected by this problem please check you only have one version of device-mapper & lvm2 installed. See bug 207474 for further details. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. If this bug has been fixed, but you are now experiencing a different problem, please file a separate bug for the new problem. Thank you. http://darnok.com/kernels/kernel-2.6.17-1.2174_bz196856_FC5.i686.rpm Please test that kernel before moving to FC6. The message in comment #18 is a canned one for all the bugs that have FC5 in them. Just install the kernel using the rpm command. If it complains, do a --force. The --force won't overwrite anything as the kernel you are installing is different than the stock one. Please boot it up without the 'noapic' parameter and see if it works and attach the 'dmesg' file to this BZ. You can remove the kernel safely afterwards by doing 'rpm -e <kernel>'. Thanks. Just tried it (kernel in #19) on my NetVista 8305. No change. after one day of operation, the system was noticeably slow and the system clock was almost 12 hours behind. Reverting back to the previous kernel and use of "noapic" ;-) Ali, Can you include the output of dmidecode in this BZ pls? Thank you. closing BZ as WONTFIX. The workaround is to supply "noapic" as a bootup argument until the BIOS team comes up with a fix. |