89501 – Oracle 9i RAC machine becomes unresponsive

Bug 89501 - Oracle 9i RAC machine becomes unresponsive

Summary: Oracle 9i RAC machine becomes unresponsive

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 2.1
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	2.1
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Larry Woodman
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2003-04-23 16:05 UTC by Larry Troan
Modified:	2016-04-18 09:40 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2003-04-23 16:21:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Larry Troan 2003-04-23 16:05:15 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0rc1) Gecko/20020424

Description of problem:
THIS IS A BUGZILLA DOCUMENTING A PROBLEM THAT WAS FIXED IN THE RHEL AS2.1 QU2  
 BETA ERRATA WITH THE -e18 kernel. NO ACTION REQUIRED HERE.

Dell PE6650 4CPU (noht) 12 Gigs Ram, 20 Gigs swap. All updates,
kernel-2.4.9-e.12enterprise
(2) Qlogic 2312 -qla2300_60300
   Connectrix
   Symmetrix
EMC Powerpath

Machine slows to a crawl becomes unresponsive, Ping, TNS Ping only (means Oracle
doesn't fail over), Keyboard lights responsive. No unusual messages in logs.

Suggested serial console, enable MagicSysRQ. and remote top. Capture sysrq
output other info and send it in for analysis.

Version-Release number of selected component (if applicable):
kernel-2.4.9-e12enterprise

How reproducible:
Sometimes

Steps to Reproduce:
1.Have machine run for about two days under heavy load and machine will lock up
as descrived above. 
2.
3.
    

Additional info:

Comment 2 Larry Troan 2003-04-23 16:18:16 UTC

Had customer remove PowerPath binary module and switch from Emulex to QLogic
adapters so Red Hat would support them fully. Also had customer replace the GB
crossover cable between the two onboard Broadcom NIC adapters with a standard
CAT5E cable. Dell had them reconfigure their QLogic adapters so they didn't
share IRQs. Oracle had them apply latest TARs. We also verified they had the
latest ntp patch.

Got SysRq+t,+p,+m traces and Engineering found the following problem.....
 
Event posted 03-28-2003 11:33am by lwoodman with duration of 0.00
OK, I think we see what the problem is here:
kswapd eventually calls invalidate_inode_pages() on one of the cpus
and that takes the pagemap_lru_lock before entering what can be a
very long loop of looking at inode pages.  This does a spin_trylock
of the page hash list lock for each page and if that fails, lets go
of the pagemap_lru_lock than re-enters the "very long loop".  In this
case one of the other cpus has the page hash list lock and is spinning
for the pagemap_lru_lock.  On certain hardware/bus configurations the
other cpu may never get the pagemap_lru_lock even though its spinning
on it because it is in the cache of the cpu that owned it and wants it
again.  This can cause the system to deadlock.  The way to fix this is
to limit the number of inode pages invalidate_inode_pages() processes
when its called from kswapd.  Are you willing to try out a new kernel
that has only this change???  You are the only customer we have seen
with this problem so we have no other way to verify this fix.

Larry Woodman, Dave Anderson and Rik van Riel...

Comment 3 Larry Troan 2003-04-23 16:21:26 UTC

Provided patch for customer to try. Ran succefully for over two weeks and agreed
to close incident.
---
Fix is in the RHELAS2.1 QU2 errata (-e18 kerrnel).

Reference Issue Tracker 17733. Bugzilla opened so Red Hat partners can track
this problem and its resolution.

Comment 4 Larry Troan 2003-04-23 16:26:28 UTC

Note that customer planned to reinstall PowerPath software once problem was
resolved. Also, there is a bug in the qla2300_060300 driver fixed in the
qla2300_0604 driver which is also in the QU2 -e18 kernel. We tried a qla fix
prior to getting the p-trace in case this was what the customer was hitting.
Subsequently reverted back to the qla2300_060300 driver when we found the kernel
problem.

Note You need to log in before you can comment on or make changes to this bug.