Description of problem: System appears frozen, yet responds to pings and TCP connects. No userspace processes run. sysrq-P reports idle in swapper, sysrq-T shows processes stuck in page_fault. Sync starts, does not finish. Remount starts but does not finish. sysrq-B mostly works. Version-Release number of selected component (if applicable): 2.4.20-18.7bigmem How reproducible: Erratically. Not sure what combination of events trigger it yet, but it may be rapid operations on LARGE (2GB) files. Actual results: System freezes, no login or network access beyond the kernel level (ping, TCP connect). sysrq-B or power cycle needed to recover. Nothing is logged. Expected results: Graceful degradation, continued access. Log of problems. Additional info: Hardware: Compaq DL380-G3 2 x Xeon 2.8GHz 4.5 GB RAM 260 Mb swap System is database server (Informix and MySQL). Many large (2GB) database files are open, many database processes with large memory and shmem allocations are running. sysrq-P: (hand copied) PID 0, comm: swapper EIP 0010:[c0106e89] CPU:0 EIP is at default_idle [kernel] 0x29 (2.4.20-18.7bigmem) EFLAGS: 00000246 Tainted: P EAX: 00000000 EBX: c0106e60 ECX: 00000032 EBX: c0354000 ESI: c0354000 EDI: c0354000 EBP: c0106e60 DS: 0018 ES: 0018 CR0: 800500eb CR2: 40321c30 CR3: 2e660800 CR4 000006f0 Call Trace: [<c0106f22>] cpu_idle [kernel] 0x52 (0xc0355fcc) [<c0105000>] stext [kernel] 0x0 (0xc0355fd8) CR3 is the only value that changes on repetitive sysrq-P runs. sysrq-M: (hand copied) (far too much stuff; the important parts I think:) Swap cache: add 76637, delete 35135, find 30010/33401, race 0+25 Free swap: 436 kB 1179648 pages of RAM 884730 pages of HIGHMEM 2770325 pages shared 41502 pages swap cached sysrq-T: (hand copied) (Lots scrolls by - how are we supposed to capture all of it?) crond: D c030f980 3732 28948 1108 28941 (NOTLB) __lock_page lock_page __pte_chain_free handle_mm_fault do_notify_parent zap_page_range zap_page_range do_page_fault exit_notify unmap_fixup do_sigaction sys_rt_sigaction filp_close sys_munmap do_page_fault error_code It appears that the call tree is the same for at least the next-to-last listed process, the tail end of which was still visible. Contention for swap maybe? I am adding more swap space.
EFLAGS: 00000246 Tainted: P which kernel modules are in use ?
Module Size Used by Tainted: P sg 34884 2 (autoclean) st 30416 0 (autoclean) aic7xxx 155136 2 cpqasm 323552 22 cpqevt 9440 2 [cpqasm] nfsd 78048 8 (autoclean) lockd 57600 1 (autoclean) [nfsd] sunrpc 83124 1 (autoclean) [nfsd lockd] bcm5700 102020 2 loop 11376 0 (autoclean) usb-ohci 22016 0 (unused) usbcore 77536 1 [usb-ohci] ext3 69536 4 jbd 51720 4 [ext3] lvm-mod 65376 5 cciss 43360 5 sd_mod 12828 0 (unused) scsi_mod 112892 7 [sg st aic7xxx cciss sd_mod]
aic7xxx: aic79xx-linux-2.4-20030603-tar.gz.tgz bcm5700: bcm5700-6.0.2d-1.src.rpm Both in use to avoid system lockups.
and cpqasm which is known to cause lockups and reboots and is binary only... *** This bug has been marked as a duplicate of 78616 ***
This is happening with an untainted kernel. The difference from the above sysrq-p data is minor... sysrq-p: CR2 was 40023000 CR3 was different but it changes with each sysrq-p sysrq-m: Swap cache: add 0, delete 0, find 0/0, race 0+0 Free swap: 785392 kB (I added a bunch of swap, it apparently hasn't helped) 0 pages swap cached ...so swap wasn't even being touched. sysrq-t: the last entry was mysqld in block_sync_page. One possibly related thing is the system console had some diagnostic output from the AIC7xxx driver. The only devices plugged into that controller are a tape library and tape drive - the hard drives are on a RAID controller. I'm going to disable the tape software and unload the AIC7xxx module and see if it crashes again.
It did not freeze with the aic7xxx module unloaded, but I cannot perform the test again with it loaded until this weekend as this is a production server. It appears fairly consistently that the cause is many (write?) operations on large files. The last three lockups consistently happened during database bulk-load operations. Normal database use does not cause a lockup. The database data files are 2GB cooked files on lvm ext3 filesystems on a cciss (HP) RAID controller.
One of our developers did a database bulk load without checking with me, and it froze; I rebooted, removed the aic7xxx module, and told them to try again, and it completed successfully. I'm beginning to suspect the aic7xxx driver, even though the only thing plugged into it is a tape library and it does not directly have anything to do with access to the filesystems.
Okay, I have substituted an LSI Logic U160 SCSI controller for the Adaptec 29160 and am using the stock sym53c8xx drivers. This is still happening. It does not appear to be directly related to the Adaptec drivers. I'm now going to try the plain SMP kernel instead of the bigmem kernel. Is there any schedule for getting the 2.4.22 kernel for RH7.3?
I'm going to try kernel-bigmem-2.4.20-20.7.i686.rpm - the errata notice says it fixes "A rarely-seen race condition in updating page tables could cause systems to hang" and this really sounds like what is happening here. Could this be bug 100739 - RH9: Newburn hangs with PERC 4 + 2.4.20-19.9smp? Why am I getting "Access Denied" on http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=100739 ?
This is still happening in 2.4.20-20.7bigmem
You're getting "access denied" on bug 100739 because you do not have sufficient priveledges to view that bug report.
Just for your information, if you encounter any problem with any Red Hat kernel which has had any 3rd party modules loaded into it after boot time, wether they are proprietary, GPL, other open source license or otherwise, your kernel and system are unsupported. Red Hat supports systems which are using the binary kernel and modules which we shipped with the OS, or the latest updates for that OS version. Any externally supplied modules are unsupported. If you are using any external modules at all, you will need to reproduce the problem without external modules ever being loaded since boot time. Also note that if you have loaded a 3rd party module, that unloading the module is not sufficient, a full reboot is required to give a clean kernel slate, as the 3rd party module may have corrupted kernel memory, and we can not support that.
> You're getting "access denied" on bug 100739 because > you do not have sufficient priveledges to view that bug report. Forgive me. What I meant to ask is: why do bug reports have access permissions, and why is this bug being "censored"?
> If you are using any [non-RH] modules at all, you will need > to reproduce the problem without [non-RH] modules ever being > loaded since boot time. This presents a problem, since the module (tg3) that RH ships for the Broadcom gigabit chip is itself unreliable (at least it was at 2.4.18) and locks the system up. I will try to purify the modules as soon as our crunch is over at the end of this week, and see what happens. Unfortunately I cannot reliably reproduce the hang on demand.
Okay, I obtained a firmware update from Compaq and applied it, and the system has been stable for about three weeks. I *think* we can close this. I don't know why it was freezing.
We too are having a similar problem, but with RHAS 2.1/e.25 kernel. Redhat insists on running tg3, compaq only supports bcm5700. Redhat says to unload proprietary modules (e.g., cciss), but we can't, since we'll never know if a disk blows without them. John, can you post the fw version(s) you upgraded to? Are you still having problems, or did the fw lick it? Thx, Rob Lojek
Yes, it has been stable since the system ROM was updated. We were having lockups with tg3, the bcm5700 module has been stable. I don't remember which version I installed; just grab the current one off the CHomPaq website - they may have updated it since November. "07/25/2003, Family 386P29, Type 03"? Eep! My redundant ROM is the older version... How do I flash the redundant ROM?
I haven't found an explicit way to force the redundant ROM to become up-to-date. From this doc, it looks like it's always a rev. behind: http://h18023.www1.hp.com/support/files/server/us/webdoc/rom/OnlineROMFlashUserGuide.pdf Backups of the previous ROM image are made using one of the following methods: ⢠Redundant ROMâThe ROM image acts as two separate ROMs. One section of the ROM contains the most current ROM version, while the other section of the ROM contains a previous version.
>> You're getting "access denied" on bug 100739 because >> you do not have sufficient priveledges to view that bug report. > Forgive me. What I meant to ask is: why do bug reports have access > permissions, and why is this bug being "censored"? Was an answer ever given to this, btw.? I, too, am interested in that particular bug.