Bug 101615 - 2.4.20-18.7bigmem kernel freezes
2.4.20-18.7bigmem kernel freezes
Status: CLOSED WONTFIX
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
7.3
i686 Linux
high Severity high
: ---
: ---
Assigned To: Arjan van de Ven
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2003-08-04 15:05 EDT by John Hardin
Modified: 2005-10-31 17:00 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2003-11-17 13:43:26 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description John Hardin 2003-08-04 15:05:35 EDT
Description of problem:
System appears frozen, yet responds to pings and TCP connects. No userspace
processes run. sysrq-P reports idle in swapper, sysrq-T shows processes stuck in
page_fault. Sync starts, does not finish. Remount starts but does not finish.
sysrq-B mostly works.

Version-Release number of selected component (if applicable):
2.4.20-18.7bigmem

How reproducible:
Erratically. Not sure what combination of events trigger it yet, but it may be
rapid operations on LARGE (2GB) files.

Actual results:
System freezes, no login or network access beyond the kernel level (ping, TCP
connect). sysrq-B or power cycle needed to recover. Nothing is logged.

Expected results:
Graceful degradation, continued access. Log of problems.

Additional info:
Hardware:
  Compaq DL380-G3
    2 x Xeon 2.8GHz
    4.5 GB RAM
    260 Mb swap

System is database server (Informix and MySQL). 
Many large (2GB) database files are open, many database processes with large
memory and shmem allocations are running.

sysrq-P: (hand copied)
PID 0, comm:   swapper
EIP 0010:[c0106e89]  CPU:0
EIP is at default_idle [kernel] 0x29 (2.4.20-18.7bigmem)
EFLAGS: 00000246   Tainted: P
EAX: 00000000   EBX: c0106e60   ECX: 00000032   EBX: c0354000
ESI: c0354000   EDI: c0354000   EBP: c0106e60   DS:  0018  ES: 0018
CR0: 800500eb   CR2: 40321c30   CR3: 2e660800   CR4 000006f0
Call Trace: [<c0106f22>] cpu_idle [kernel] 0x52 (0xc0355fcc)
            [<c0105000>] stext [kernel] 0x0 (0xc0355fd8)

CR3 is the only value that changes on repetitive sysrq-P runs.

sysrq-M: (hand copied)
(far too much stuff; the important parts I think:)
Swap cache:  add 76637, delete 35135, find 30010/33401, race 0+25
Free swap:   436 kB
1179648 pages of RAM
884730 pages of HIGHMEM
2770325 pages shared
41502 pages swap cached

sysrq-T: (hand copied)
(Lots scrolls by - how are we supposed to capture all of it?)
crond:   D  c030f980  3732  28948  1108    28941 (NOTLB)
  __lock_page
  lock_page
  __pte_chain_free
  handle_mm_fault
  do_notify_parent
  zap_page_range
  zap_page_range
  do_page_fault
  exit_notify
  unmap_fixup
  do_sigaction
  sys_rt_sigaction
  filp_close
  sys_munmap
  do_page_fault
  error_code

It appears that the call tree is the same for at least the next-to-last listed
process, the tail end of which was still visible.

Contention for swap maybe?

I am adding more swap space.
Comment 1 Arjan van de Ven 2003-08-04 15:12:31 EDT
EFLAGS: 00000246   Tainted: P

which kernel modules are in use ?
Comment 2 John Hardin 2003-08-04 16:29:26 EDT
Module                  Size  Used by    Tainted: P
sg                     34884   2  (autoclean)
st                     30416   0  (autoclean)
aic7xxx               155136   2
cpqasm                323552  22
cpqevt                  9440   2  [cpqasm]
nfsd                   78048   8  (autoclean)
lockd                  57600   1  (autoclean) [nfsd]
sunrpc                 83124   1  (autoclean) [nfsd lockd]
bcm5700               102020   2
loop                   11376   0  (autoclean)
usb-ohci               22016   0  (unused)
usbcore                77536   1  [usb-ohci]
ext3                   69536   4
jbd                    51720   4  [ext3]
lvm-mod                65376   5
cciss                  43360   5
sd_mod                 12828   0  (unused)
scsi_mod              112892   7  [sg st aic7xxx cciss sd_mod]
Comment 3 John Hardin 2003-08-04 16:30:56 EDT
aic7xxx:  aic79xx-linux-2.4-20030603-tar.gz.tgz

bcm5700:  bcm5700-6.0.2d-1.src.rpm

Both in use to avoid system lockups.
Comment 4 Arjan van de Ven 2003-08-05 04:09:27 EDT
and cpqasm which is known to cause lockups and reboots and is binary only...


*** This bug has been marked as a duplicate of 78616 ***
Comment 5 John Hardin 2003-08-26 13:00:55 EDT
This is happening with an untainted kernel.

The difference from the above sysrq-p data is minor...

sysrq-p:
CR2 was 40023000
CR3 was different but it changes with each sysrq-p

sysrq-m:
Swap cache:  add 0, delete 0, find 0/0, race 0+0
Free swap:   785392 kB   (I added a bunch of swap, it apparently hasn't helped)
0 pages swap cached
...so swap wasn't even being touched.

sysrq-t:
the last entry was mysqld in block_sync_page.

One possibly related thing is the system console had some diagnostic output from
the AIC7xxx driver. The only devices plugged into that controller are a tape
library and tape drive - the hard drives are on a RAID controller.

I'm going to disable the tape software and unload the AIC7xxx module and see if
it crashes again.

Comment 6 John Hardin 2003-08-27 14:57:17 EDT
It did not freeze with the aic7xxx module unloaded, but I cannot perform the
test again with it loaded until this weekend as this is a production server.

It appears fairly consistently that the cause is many (write?) operations on
large files. The last three lockups consistently happened during database
bulk-load operations. Normal database use does not cause a lockup.

The database data files are 2GB cooked files on lvm ext3 filesystems on a cciss
(HP) RAID controller.
Comment 7 John Hardin 2003-09-01 13:12:47 EDT
One of our developers did a database bulk load without checking with me, and it
froze; I rebooted, removed the aic7xxx module, and told them to try again, and
it completed successfully.

I'm beginning to suspect the aic7xxx driver, even though the only thing plugged
into it is a tape library and it does not directly have anything to do with
access to the filesystems.
Comment 8 John Hardin 2003-09-21 08:58:45 EDT
Okay, I have substituted an LSI Logic U160 SCSI controller for the Adaptec 29160
and am using the stock sym53c8xx drivers. This is still happening. It does not
appear to be directly related to the Adaptec drivers.

I'm now going to try the plain SMP kernel instead of the bigmem kernel.

Is there any schedule for getting the 2.4.22 kernel for RH7.3?
Comment 9 John Hardin 2003-09-21 11:42:51 EDT
I'm going to try kernel-bigmem-2.4.20-20.7.i686.rpm - the errata notice says it
fixes "A rarely-seen race condition in updating page tables could cause systems
to hang" and this really sounds like what is happening here.

Could this be bug 100739 - RH9: Newburn hangs with PERC 4 + 2.4.20-19.9smp?

Why am I getting "Access Denied" on
http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=100739
?
Comment 10 John Hardin 2003-09-26 11:05:13 EDT
This is still happening in 2.4.20-20.7bigmem
Comment 11 Mike A. Harris 2003-09-29 16:45:07 EDT
You're getting "access denied" on bug 100739 because you do not have
sufficient priveledges to view that bug report.

Comment 12 Mike A. Harris 2003-09-29 16:48:24 EDT
Just for your information, if you encounter any problem with any Red Hat
kernel which has had any 3rd party modules loaded into it after boot time,
wether they are proprietary, GPL, other open source license or otherwise,
your kernel and system are unsupported.  Red Hat supports systems which
are using the binary kernel and modules which we shipped with the OS, or
the latest updates for that OS version.  Any externally supplied modules
are unsupported.

If you are using any external modules at all, you will need to reproduce the
problem without external modules ever being loaded since boot time.  Also
note that if you have loaded a 3rd party module, that unloading the module
is not sufficient, a full reboot is required to give a clean kernel slate,
as the 3rd party module may have corrupted kernel memory, and we can not
support that.
Comment 13 John Hardin 2003-09-29 17:12:49 EDT
> You're getting "access denied" on bug 100739 because
> you do not have sufficient priveledges to view that bug report.

Forgive me. What I meant to ask is: why do bug reports have access permissions,
and why is this bug being "censored"?
Comment 14 John Hardin 2003-09-29 17:18:57 EDT
> If you are using any [non-RH] modules at all, you will need
> to reproduce the problem without [non-RH] modules ever being
> loaded since boot time. 

This presents a problem, since the module (tg3) that RH ships for the Broadcom
gigabit chip is itself unreliable (at least it was at 2.4.18) and locks the
system up.

I will try to purify the modules as soon as our crunch is over at the end of
this week, and see what happens. Unfortunately I cannot reliably reproduce the
hang on demand.

Comment 15 John Hardin 2003-11-17 13:43:26 EST
Okay, I obtained a firmware update from Compaq and applied it, and the
system has been stable for about three weeks.

I *think* we can close this. I don't know why it was freezing.
Comment 16 rob lojek 2004-01-16 14:43:05 EST
We too are having a similar problem, but with RHAS 2.1/e.25 kernel.
Redhat insists on running tg3, compaq only supports bcm5700. Redhat
says to unload proprietary modules (e.g., cciss), but we can't, since
we'll never know if a disk blows without them.

John, can you post the fw version(s) you upgraded to? Are you still
having problems, or did the fw lick it?

Thx,

Rob Lojek
Comment 17 John Hardin 2004-01-16 16:17:53 EST
Yes, it has been stable since the system ROM was updated.

We were having lockups with tg3, the bcm5700 module has been stable.

I don't remember which version I installed; just grab the current one
off the CHomPaq website - they may have updated it since November.

"07/25/2003, Family 386P29, Type 03"?

Eep! My redundant ROM is the older version... How do I flash the
redundant ROM?
Comment 18 rob lojek 2004-01-16 16:28:51 EST
I haven't found an explicit way to force the redundant ROM to become
up-to-date. From this doc, it looks like it's always a rev. behind:


http://h18023.www1.hp.com/support/files/server/us/webdoc/rom/OnlineROMFlashUserGuide.pdf


Backups of the previous ROM image are made using one of the following
methods:
• Redundant ROM—The ROM image acts as two separate ROMs. One section
of the
ROM contains the most current ROM version, while the other section of
the ROM
contains a previous version.

Comment 19 rob lojek 2004-01-16 16:32:36 EST
>> You're getting "access denied" on bug 100739 because
>> you do not have sufficient priveledges to view that bug report.

> Forgive me. What I meant to ask is: why do bug reports have access 
> permissions, and why is this bug being "censored"?

Was an answer ever given to this, btw.? I, too, am interested in that
particular bug.

Note You need to log in before you can comment on or make changes to this bug.