Bug 678821

Summary: processes being blocked by kernel; machine eventually becomes unresponsive
Product: Red Hat Enterprise Linux 5 Reporter: Geoff Quelch <gequelch>
Component: kernelAssignee: Prarit Bhargava <prarit>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 5.6CC: jarod
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-03-24 14:16:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Geoff Quelch 2011-02-19 22:54:30 UTC
Description of problem: Receiving this in the log file:
Feb 15 08:08:47 npws01 kernel: INFO: task oracle:2517 blocked for more than 120 seconds.
Feb 15 08:08:47 npws01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 15 08:08:47 npws01 kernel: oracle        D ffff81000900caa0     0  2517   2508                2516 (NOTLB)
Feb 15 08:08:47 npws01 kernel:  ffff810135b5da58 0000000000000082 0000000000000000 ffff810438833d78
Feb 15 08:08:47 npws01 kernel:  ffff810438833c00 0000000000000009 ffff81043fff5860 ffff81010ef0e100
Feb 15 08:08:47 npws01 kernel:  00003b404f3288a6 0000000000044583 ffff81043fff5a48 0000000138833c00
Feb 15 08:08:47 npws01 kernel: Call Trace:

Feb 15 08:08:47 npws01 kernel: Call Trace:
Feb 15 08:08:47 npws01 kernel:  [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90
Feb 15 08:08:47 npws01 kernel:  [<ffffffff80028b0b>] sync_page+0x0/0x43
Feb 15 08:08:47 npws01 kernel:  [<ffffffff800637ca>] io_schedule+0x3f/0x67
Feb 15 08:08:47 npws01 kernel:  [<ffffffff80028b49>] sync_page+0x3e/0x43
Feb 15 08:08:47 npws01 kernel:  [<ffffffff8006390e>] __wait_on_bit_lock+0x36/0x66
Feb 15 08:08:47 npws01 kernel:  [<ffffffff8003fdc1>] __lock_page+0x5e/0x64
Feb 15 08:08:47 npws01 kernel:  [<ffffffff800a28e2>] wake_bit_function+0x0/0x23
Feb 15 08:08:47 npws01 kernel:  [<ffffffff80013b22>] find_lock_page+0x69/0xa2
Feb 15 08:08:47 npws01 kernel:  [<ffffffff800c805f>] grab_cache_page_write_begin+0x2c/0x89
Feb 15 08:08:47 npws01 kernel:  [<ffffffff88665125>] :nfs:nfs_write_begin+0x41/0xf8
Feb 15 08:08:47 npws01 kernel:  [<ffffffff8000fda9>] generic_file_buffered_write+0x14b/0x675
Feb 15 08:08:47 npws01 kernel:  [<ffffffff8003fdc1>] __lock_page+0x5e/0x64
Feb 15 08:08:47 npws01 kernel:  [<ffffffff8001669b>] __generic_file_aio_write_nolock+0x369/0x3b6
Feb 15 08:08:47 npws01 kernel:  [<ffffffff80021872>] generic_file_aio_write+0x65/0xc1
Feb 15 08:08:47 npws01 kernel:  [<ffffffff8866584d>] :nfs:nfs_file_write+0xd8/0x14f
Feb 15 08:08:47 npws01 kernel:  [<ffffffff80018301>] do_sync_write+0xc7/0x104
Feb 15 08:08:47 npws01 kernel:  [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e
Feb 15 08:08:47 npws01 kernel:  [<ffffffff80016aa3>] vfs_write+0xce/0x174
Feb 15 08:08:47 npws01 kernel:  [<ffffffff8004400e>] sys_pwrite64+0x50/0x70
Feb 15 08:08:47 npws01 kernel:  [<ffffffff8005d229>] tracesys+0x71/0xe0
Feb 15 08:08:47 npws01 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Feb 15 08:08:47 npws01 kernel:

Version-Release number of selected component (if applicable):
[root@npws01 log]# uname -a
Linux npws01.deos.udel.edu 2.6.18-238.1.1.el5 #1 SMP Tue Jan 4 13:32:19 EST 2011 x86_64 x86_64 x86_64 GNU/Linux


How reproducible:
Occasional during heavy load.

Steps to Reproduce:
1.
2.
3.
  
Actual results:
System eventually becomes responsive

Expected results:


Additional info:
WOuld liek some idea how to trace what is causing the problem and then how to avoid it.

Comment 1 Prarit Bhargava 2011-02-25 13:24:19 UTC
Hi -- you're running Oracle?  Does the problem happen if Oracle is not running?

P.

Comment 2 Geoff Quelch 2011-02-25 23:40:50 UTC
Yes, this is an Oracle server. Unfortunately this is our production server so we can't do any tests. I doubt this problem would recur; we have already disabled the Oracle backups and have not had the problem since.

But, we can't go for long like that and would like to know how we can identify what processes are consuming resources such that we received the message above.

The message indicates the victim, how do we find the root cause?

Thanks.

Comment 3 Prarit Bhargava 2011-02-25 23:51:58 UTC
(In reply to comment #2)
> Yes, this is an Oracle server. Unfortunately this is our production server so
> we can't do any tests. I doubt this problem would recur; we have already
> disabled the Oracle backups and have not had the problem since.

Thanks for the info Geoff.

> 
> But, we can't go for long like that and would like to know how we can identify
> what processes are consuming resources such that we received the message above.


> 
> The message indicates the victim, how do we find the root cause?

There are a few ways to determine the cause by triggering a stack trace or panic when this issue occurs.  Both of those, unfortunately, require a modification of the kernel.

So this only happens when Oracle is loaded and is doing a backup?

P.

> 
> Thanks.