Red Hat Bugzilla – Bug 101615
2.4.20-18.7bigmem kernel freezes
Last modified: 2005-10-31 17:00:50 EST
Description of problem:
System appears frozen, yet responds to pings and TCP connects. No userspace
processes run. sysrq-P reports idle in swapper, sysrq-T shows processes stuck in
page_fault. Sync starts, does not finish. Remount starts but does not finish.
sysrq-B mostly works.
Version-Release number of selected component (if applicable):
Erratically. Not sure what combination of events trigger it yet, but it may be
rapid operations on LARGE (2GB) files.
System freezes, no login or network access beyond the kernel level (ping, TCP
connect). sysrq-B or power cycle needed to recover. Nothing is logged.
Graceful degradation, continued access. Log of problems.
2 x Xeon 2.8GHz
4.5 GB RAM
260 Mb swap
System is database server (Informix and MySQL).
Many large (2GB) database files are open, many database processes with large
memory and shmem allocations are running.
sysrq-P: (hand copied)
PID 0, comm: swapper
EIP 0010:[c0106e89] CPU:0
EIP is at default_idle [kernel] 0x29 (2.4.20-18.7bigmem)
EFLAGS: 00000246 Tainted: P
EAX: 00000000 EBX: c0106e60 ECX: 00000032 EBX: c0354000
ESI: c0354000 EDI: c0354000 EBP: c0106e60 DS: 0018 ES: 0018
CR0: 800500eb CR2: 40321c30 CR3: 2e660800 CR4 000006f0
Call Trace: [<c0106f22>] cpu_idle [kernel] 0x52 (0xc0355fcc)
[<c0105000>] stext [kernel] 0x0 (0xc0355fd8)
CR3 is the only value that changes on repetitive sysrq-P runs.
sysrq-M: (hand copied)
(far too much stuff; the important parts I think:)
Swap cache: add 76637, delete 35135, find 30010/33401, race 0+25
Free swap: 436 kB
1179648 pages of RAM
884730 pages of HIGHMEM
2770325 pages shared
41502 pages swap cached
sysrq-T: (hand copied)
(Lots scrolls by - how are we supposed to capture all of it?)
crond: D c030f980 3732 28948 1108 28941 (NOTLB)
It appears that the call tree is the same for at least the next-to-last listed
process, the tail end of which was still visible.
Contention for swap maybe?
I am adding more swap space.
EFLAGS: 00000246 Tainted: P
which kernel modules are in use ?
Module Size Used by Tainted: P
sg 34884 2 (autoclean)
st 30416 0 (autoclean)
aic7xxx 155136 2
cpqasm 323552 22
cpqevt 9440 2 [cpqasm]
nfsd 78048 8 (autoclean)
lockd 57600 1 (autoclean) [nfsd]
sunrpc 83124 1 (autoclean) [nfsd lockd]
bcm5700 102020 2
loop 11376 0 (autoclean)
usb-ohci 22016 0 (unused)
usbcore 77536 1 [usb-ohci]
ext3 69536 4
jbd 51720 4 [ext3]
lvm-mod 65376 5
cciss 43360 5
sd_mod 12828 0 (unused)
scsi_mod 112892 7 [sg st aic7xxx cciss sd_mod]
Both in use to avoid system lockups.
and cpqasm which is known to cause lockups and reboots and is binary only...
*** This bug has been marked as a duplicate of 78616 ***
This is happening with an untainted kernel.
The difference from the above sysrq-p data is minor...
CR2 was 40023000
CR3 was different but it changes with each sysrq-p
Swap cache: add 0, delete 0, find 0/0, race 0+0
Free swap: 785392 kB (I added a bunch of swap, it apparently hasn't helped)
0 pages swap cached
...so swap wasn't even being touched.
the last entry was mysqld in block_sync_page.
One possibly related thing is the system console had some diagnostic output from
the AIC7xxx driver. The only devices plugged into that controller are a tape
library and tape drive - the hard drives are on a RAID controller.
I'm going to disable the tape software and unload the AIC7xxx module and see if
it crashes again.
It did not freeze with the aic7xxx module unloaded, but I cannot perform the
test again with it loaded until this weekend as this is a production server.
It appears fairly consistently that the cause is many (write?) operations on
large files. The last three lockups consistently happened during database
bulk-load operations. Normal database use does not cause a lockup.
The database data files are 2GB cooked files on lvm ext3 filesystems on a cciss
(HP) RAID controller.
One of our developers did a database bulk load without checking with me, and it
froze; I rebooted, removed the aic7xxx module, and told them to try again, and
it completed successfully.
I'm beginning to suspect the aic7xxx driver, even though the only thing plugged
into it is a tape library and it does not directly have anything to do with
access to the filesystems.
Okay, I have substituted an LSI Logic U160 SCSI controller for the Adaptec 29160
and am using the stock sym53c8xx drivers. This is still happening. It does not
appear to be directly related to the Adaptec drivers.
I'm now going to try the plain SMP kernel instead of the bigmem kernel.
Is there any schedule for getting the 2.4.22 kernel for RH7.3?
I'm going to try kernel-bigmem-2.4.20-20.7.i686.rpm - the errata notice says it
fixes "A rarely-seen race condition in updating page tables could cause systems
to hang" and this really sounds like what is happening here.
Could this be bug 100739 - RH9: Newburn hangs with PERC 4 + 2.4.20-19.9smp?
Why am I getting "Access Denied" on
This is still happening in 2.4.20-20.7bigmem
You're getting "access denied" on bug 100739 because you do not have
sufficient priveledges to view that bug report.
Just for your information, if you encounter any problem with any Red Hat
kernel which has had any 3rd party modules loaded into it after boot time,
wether they are proprietary, GPL, other open source license or otherwise,
your kernel and system are unsupported. Red Hat supports systems which
are using the binary kernel and modules which we shipped with the OS, or
the latest updates for that OS version. Any externally supplied modules
If you are using any external modules at all, you will need to reproduce the
problem without external modules ever being loaded since boot time. Also
note that if you have loaded a 3rd party module, that unloading the module
is not sufficient, a full reboot is required to give a clean kernel slate,
as the 3rd party module may have corrupted kernel memory, and we can not
> You're getting "access denied" on bug 100739 because
> you do not have sufficient priveledges to view that bug report.
Forgive me. What I meant to ask is: why do bug reports have access permissions,
and why is this bug being "censored"?
> If you are using any [non-RH] modules at all, you will need
> to reproduce the problem without [non-RH] modules ever being
> loaded since boot time.
This presents a problem, since the module (tg3) that RH ships for the Broadcom
gigabit chip is itself unreliable (at least it was at 2.4.18) and locks the
I will try to purify the modules as soon as our crunch is over at the end of
this week, and see what happens. Unfortunately I cannot reliably reproduce the
hang on demand.
Okay, I obtained a firmware update from Compaq and applied it, and the
system has been stable for about three weeks.
I *think* we can close this. I don't know why it was freezing.
We too are having a similar problem, but with RHAS 2.1/e.25 kernel.
Redhat insists on running tg3, compaq only supports bcm5700. Redhat
says to unload proprietary modules (e.g., cciss), but we can't, since
we'll never know if a disk blows without them.
John, can you post the fw version(s) you upgraded to? Are you still
having problems, or did the fw lick it?
Yes, it has been stable since the system ROM was updated.
We were having lockups with tg3, the bcm5700 module has been stable.
I don't remember which version I installed; just grab the current one
off the CHomPaq website - they may have updated it since November.
"07/25/2003, Family 386P29, Type 03"?
Eep! My redundant ROM is the older version... How do I flash the
I haven't found an explicit way to force the redundant ROM to become
up-to-date. From this doc, it looks like it's always a rev. behind:
Backups of the previous ROM image are made using one of the following
â¢ Redundant ROMâThe ROM image acts as two separate ROMs. One section
ROM contains the most current ROM version, while the other section of
contains a previous version.
>> You're getting "access denied" on bug 100739 because
>> you do not have sufficient priveledges to view that bug report.
> Forgive me. What I meant to ask is: why do bug reports have access
> permissions, and why is this bug being "censored"?
Was an answer ever given to this, btw.? I, too, am interested in that