From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0rc1) Gecko/20020424 Description of problem: THIS IS A BUGZILLA DOCUMENTING A PROBLEM THAT WAS FIXED IN THE RHEL AS2.1 QU2 BETA ERRATA WITH THE -e18 kernel. NO ACTION REQUIRED HERE. Dell PE6650 4CPU (noht) 12 Gigs Ram, 20 Gigs swap. All updates, kernel-2.4.9-e.12enterprise (2) Qlogic 2312 -qla2300_60300 Connectrix Symmetrix EMC Powerpath Machine slows to a crawl becomes unresponsive, Ping, TNS Ping only (means Oracle doesn't fail over), Keyboard lights responsive. No unusual messages in logs. Suggested serial console, enable MagicSysRQ. and remote top. Capture sysrq output other info and send it in for analysis. Version-Release number of selected component (if applicable): kernel-2.4.9-e12enterprise How reproducible: Sometimes Steps to Reproduce: 1.Have machine run for about two days under heavy load and machine will lock up as descrived above. 2. 3. Additional info:
Had customer remove PowerPath binary module and switch from Emulex to QLogic adapters so Red Hat would support them fully. Also had customer replace the GB crossover cable between the two onboard Broadcom NIC adapters with a standard CAT5E cable. Dell had them reconfigure their QLogic adapters so they didn't share IRQs. Oracle had them apply latest TARs. We also verified they had the latest ntp patch. Got SysRq+t,+p,+m traces and Engineering found the following problem..... Event posted 03-28-2003 11:33am by lwoodman with duration of 0.00 OK, I think we see what the problem is here: kswapd eventually calls invalidate_inode_pages() on one of the cpus and that takes the pagemap_lru_lock before entering what can be a very long loop of looking at inode pages. This does a spin_trylock of the page hash list lock for each page and if that fails, lets go of the pagemap_lru_lock than re-enters the "very long loop". In this case one of the other cpus has the page hash list lock and is spinning for the pagemap_lru_lock. On certain hardware/bus configurations the other cpu may never get the pagemap_lru_lock even though its spinning on it because it is in the cache of the cpu that owned it and wants it again. This can cause the system to deadlock. The way to fix this is to limit the number of inode pages invalidate_inode_pages() processes when its called from kswapd. Are you willing to try out a new kernel that has only this change??? You are the only customer we have seen with this problem so we have no other way to verify this fix. Larry Woodman, Dave Anderson and Rik van Riel...
Provided patch for customer to try. Ran succefully for over two weeks and agreed to close incident. --- Fix is in the RHELAS2.1 QU2 errata (-e18 kerrnel). Reference Issue Tracker 17733. Bugzilla opened so Red Hat partners can track this problem and its resolution.
Note that customer planned to reinstall PowerPath software once problem was resolved. Also, there is a bug in the qla2300_060300 driver fixed in the qla2300_0604 driver which is also in the QU2 -e18 kernel. We tried a qla fix prior to getting the p-trace in case this was what the customer was hitting. Subsequently reverted back to the qla2300_060300 driver when we found the kernel problem.