477937 – Tasks getting stuck in TASK_UNINTERRUPTIBLE on all 2.6.27 kernels

Bug 477937 - Tasks getting stuck in TASK_UNINTERRUPTIBLE on all 2.6.27 kernels

Summary: Tasks getting stuck in TASK_UNINTERRUPTIBLE on all 2.6.27 kernels

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	9
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-12-25 21:33 UTC by Sam Varshavchik
Modified:	2009-02-16 13:55 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-02-16 13:55:29 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Sam Varshavchik 2008-12-25 21:33:40 UTC

One of my dual quad-core x86_64 servers is reproducibly breaking under load under every 2.6.27 kernel released for F9 so far. The last kernel this server runs reliably is kernel-2.6.26.6-79.fc9.x86_64. This keeps me from upgrading to F10.

Symptoms:

The kernel boots normally and runs fine as long as it's not under load. After several minutes of a heavy compile-link duty cycle, processes start to hang. The box responds to pings, but stops accepting any more ssh connections. If a login session is active on the system console, bash reissues a prompt in response to empty commands, but attempting to run any command hangs the console, and makes it unresponsive.

The load is disk I/O load -- this box has 4GB of RAM, and there's usually a couple of gigs free when it hangs. The SATA chipset is sata_nv.ko; a pair of SATA disks in a RAID-1 configuration.

On one occasion the system ended up in a partially-frozen state, I was able to run 'ps' and see a lot of gcc processes stuck in "wait" state.

I managed to set up kexec, and after triggering this reproducible hang, succesfully alt-sysrq-c a 4GB kdump. This is what crash's "ps -u" comes back with:

...

  11023   8863   3  ffff88010f845b80  IN   0.1   87672   2464  sh
  11138  11002   2  ffff8801190cadc0  UN   0.0    3924    516  gcc
  11182   8863   6  ffff880116db96e0  IN   0.1   87672   2460  sh
  11208   8863   4  ffff88010f920000  IN   0.1   87668   2436  sh
  11262  10652   2  ffff880073ce8000  UN   0.0    3924    512  gcc
  11268  11023   2  ffff880073ce2dc0  UN   0.0    3924    508  gcc
  11287  10668   2  ffff8800708f96e0  UN   0.0    3924    508  gcc
  11330  11182   4  ffff88010f9c0000  IN   0.0    3924    480  gcc
  11333  11330   4  ffff88010f46db80  IN   0.0    4100    600  gcc
  11336  11333   1  ffff88010f468000  RU   0.0   18028   1532  cc1
  11347  11208   1  ffff88010f8416e0  RU   0.0   87668   1436  sh

...

Looks like a whole bunch of processes in TASKS_UNINTERRUPTIBLE.

  14058   3262   2  ffff8800708fadc0  UN   0.0  128956   1472  crond

crond is nailed too :-)

I can make the 4GB vmcore (from 2.6.27.7-53.fc9.x86_64) available for download somewhere (my upstream bandwidth is 1mb/s), or I can run anything else in crash, and respond with the results. I don't really know much about kernel debugging -- just enough to run crash and type commands.

I can also generate more vmcore dumps, if this one has nothing useful.

Comment 1 Sam Varshavchik 2009-02-16 13:55:29 UTC

This bug can no longer be reproduced in 2.6.27.12, presumably fixed.

Note You need to log in before you can comment on or make changes to this bug.