Bug 678821 - processes being blocked by kernel; machine eventually becomes unresponsive
Summary: processes being blocked by kernel; machine eventually becomes unresponsive
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.6
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: Prarit Bhargava
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-02-19 22:54 UTC by Geoff Quelch
Modified: 2011-03-24 14:16 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-03-24 14:16:31 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Geoff Quelch 2011-02-19 22:54:30 UTC
Description of problem: Receiving this in the log file:
Feb 15 08:08:47 npws01 kernel: INFO: task oracle:2517 blocked for more than 120 seconds.
Feb 15 08:08:47 npws01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 15 08:08:47 npws01 kernel: oracle        D ffff81000900caa0     0  2517   2508                2516 (NOTLB)
Feb 15 08:08:47 npws01 kernel:  ffff810135b5da58 0000000000000082 0000000000000000 ffff810438833d78
Feb 15 08:08:47 npws01 kernel:  ffff810438833c00 0000000000000009 ffff81043fff5860 ffff81010ef0e100
Feb 15 08:08:47 npws01 kernel:  00003b404f3288a6 0000000000044583 ffff81043fff5a48 0000000138833c00
Feb 15 08:08:47 npws01 kernel: Call Trace:

Feb 15 08:08:47 npws01 kernel: Call Trace:
Feb 15 08:08:47 npws01 kernel:  [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90
Feb 15 08:08:47 npws01 kernel:  [<ffffffff80028b0b>] sync_page+0x0/0x43
Feb 15 08:08:47 npws01 kernel:  [<ffffffff800637ca>] io_schedule+0x3f/0x67
Feb 15 08:08:47 npws01 kernel:  [<ffffffff80028b49>] sync_page+0x3e/0x43
Feb 15 08:08:47 npws01 kernel:  [<ffffffff8006390e>] __wait_on_bit_lock+0x36/0x66
Feb 15 08:08:47 npws01 kernel:  [<ffffffff8003fdc1>] __lock_page+0x5e/0x64
Feb 15 08:08:47 npws01 kernel:  [<ffffffff800a28e2>] wake_bit_function+0x0/0x23
Feb 15 08:08:47 npws01 kernel:  [<ffffffff80013b22>] find_lock_page+0x69/0xa2
Feb 15 08:08:47 npws01 kernel:  [<ffffffff800c805f>] grab_cache_page_write_begin+0x2c/0x89
Feb 15 08:08:47 npws01 kernel:  [<ffffffff88665125>] :nfs:nfs_write_begin+0x41/0xf8
Feb 15 08:08:47 npws01 kernel:  [<ffffffff8000fda9>] generic_file_buffered_write+0x14b/0x675
Feb 15 08:08:47 npws01 kernel:  [<ffffffff8003fdc1>] __lock_page+0x5e/0x64
Feb 15 08:08:47 npws01 kernel:  [<ffffffff8001669b>] __generic_file_aio_write_nolock+0x369/0x3b6
Feb 15 08:08:47 npws01 kernel:  [<ffffffff80021872>] generic_file_aio_write+0x65/0xc1
Feb 15 08:08:47 npws01 kernel:  [<ffffffff8866584d>] :nfs:nfs_file_write+0xd8/0x14f
Feb 15 08:08:47 npws01 kernel:  [<ffffffff80018301>] do_sync_write+0xc7/0x104
Feb 15 08:08:47 npws01 kernel:  [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e
Feb 15 08:08:47 npws01 kernel:  [<ffffffff80016aa3>] vfs_write+0xce/0x174
Feb 15 08:08:47 npws01 kernel:  [<ffffffff8004400e>] sys_pwrite64+0x50/0x70
Feb 15 08:08:47 npws01 kernel:  [<ffffffff8005d229>] tracesys+0x71/0xe0
Feb 15 08:08:47 npws01 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Feb 15 08:08:47 npws01 kernel:

Version-Release number of selected component (if applicable):
[root@npws01 log]# uname -a
Linux npws01.deos.udel.edu 2.6.18-238.1.1.el5 #1 SMP Tue Jan 4 13:32:19 EST 2011 x86_64 x86_64 x86_64 GNU/Linux


How reproducible:
Occasional during heavy load.

Steps to Reproduce:
1.
2.
3.
  
Actual results:
System eventually becomes responsive

Expected results:


Additional info:
WOuld liek some idea how to trace what is causing the problem and then how to avoid it.

Comment 1 Prarit Bhargava 2011-02-25 13:24:19 UTC
Hi -- you're running Oracle?  Does the problem happen if Oracle is not running?

P.

Comment 2 Geoff Quelch 2011-02-25 23:40:50 UTC
Yes, this is an Oracle server. Unfortunately this is our production server so we can't do any tests. I doubt this problem would recur; we have already disabled the Oracle backups and have not had the problem since.

But, we can't go for long like that and would like to know how we can identify what processes are consuming resources such that we received the message above.

The message indicates the victim, how do we find the root cause?

Thanks.

Comment 3 Prarit Bhargava 2011-02-25 23:51:58 UTC
(In reply to comment #2)
> Yes, this is an Oracle server. Unfortunately this is our production server so
> we can't do any tests. I doubt this problem would recur; we have already
> disabled the Oracle backups and have not had the problem since.

Thanks for the info Geoff.

> 
> But, we can't go for long like that and would like to know how we can identify
> what processes are consuming resources such that we received the message above.


> 
> The message indicates the victim, how do we find the root cause?

There are a few ways to determine the cause by triggering a stack trace or panic when this issue occurs.  Both of those, unfortunately, require a modification of the kernel.

So this only happens when Oracle is loaded and is doing a backup?

P.

> 
> Thanks.


Note You need to log in before you can comment on or make changes to this bug.