Bug 477937 - Tasks getting stuck in TASK_UNINTERRUPTIBLE on all 2.6.27 kernels
Summary: Tasks getting stuck in TASK_UNINTERRUPTIBLE on all 2.6.27 kernels
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 9
Hardware: x86_64
OS: Linux
low
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-12-25 21:33 UTC by Sam Varshavchik
Modified: 2009-02-16 13:55 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-02-16 13:55:29 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Sam Varshavchik 2008-12-25 21:33:40 UTC
One of my dual quad-core x86_64 servers is reproducibly breaking under load under every 2.6.27 kernel released for F9 so far. The last kernel this server runs reliably is kernel-2.6.26.6-79.fc9.x86_64. This keeps me from upgrading to F10.

Symptoms:

The kernel boots normally and runs fine as long as it's not under load. After several minutes of a heavy compile-link duty cycle, processes start to hang. The box responds to pings, but stops accepting any more ssh connections. If a login session is active on the system console, bash reissues a prompt in response to empty commands, but attempting to run any command hangs the console, and makes it unresponsive.

The load is disk I/O load -- this box has 4GB of RAM, and there's usually a couple of gigs free when it hangs. The SATA chipset is sata_nv.ko; a pair of SATA disks in a RAID-1 configuration.

On one occasion the system ended up in a partially-frozen state, I was able to run 'ps' and see a lot of gcc processes stuck in "wait" state.

I managed to set up kexec, and after triggering this reproducible hang, succesfully alt-sysrq-c a 4GB kdump. This is what crash's "ps -u" comes back with:

...

  11023   8863   3  ffff88010f845b80  IN   0.1   87672   2464  sh
  11138  11002   2  ffff8801190cadc0  UN   0.0    3924    516  gcc
  11182   8863   6  ffff880116db96e0  IN   0.1   87672   2460  sh
  11208   8863   4  ffff88010f920000  IN   0.1   87668   2436  sh
  11262  10652   2  ffff880073ce8000  UN   0.0    3924    512  gcc
  11268  11023   2  ffff880073ce2dc0  UN   0.0    3924    508  gcc
  11287  10668   2  ffff8800708f96e0  UN   0.0    3924    508  gcc
  11330  11182   4  ffff88010f9c0000  IN   0.0    3924    480  gcc
  11333  11330   4  ffff88010f46db80  IN   0.0    4100    600  gcc
  11336  11333   1  ffff88010f468000  RU   0.0   18028   1532  cc1
  11347  11208   1  ffff88010f8416e0  RU   0.0   87668   1436  sh

...

Looks like a whole bunch of processes in TASKS_UNINTERRUPTIBLE.

  14058   3262   2  ffff8800708fadc0  UN   0.0  128956   1472  crond

crond is nailed too :-)

I can make the 4GB vmcore (from 2.6.27.7-53.fc9.x86_64) available for download somewhere (my upstream bandwidth is 1mb/s), or I can run anything else in crash, and respond with the results. I don't really know much about kernel debugging -- just enough to run crash and type commands.

I can also generate more vmcore dumps, if this one has nothing useful.

Comment 1 Sam Varshavchik 2009-02-16 13:55:29 UTC
This bug can no longer be reproduced in 2.6.27.12, presumably fixed.


Note You need to log in before you can comment on or make changes to this bug.