139113 – System hangs for 15-45 seconds on RHEL3 / kernel 2.4.21-20.EL

Bug 139113 - System hangs for 15-45 seconds on RHEL3 / kernel 2.4.21-20.EL

Summary: System hangs for 15-45 seconds on RHEL3 / kernel 2.4.21-20.EL

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Larry Woodman
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	156320
TreeView+	depends on / blocked

Reported:	2004-11-12 23:13 UTC by John Caruso
Modified:	2007-11-30 22:07 UTC (History)
CC List:	6 users (show)
Fixed In Version:	RHSA-2005-663
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-09-28 14:33:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
12:28 sar -q / AltSysrq output (30.40 KB, text/plain) 2004-11-12 23:20 UTC, John Caruso	no flags	Details
14:22 sar -q/AltSysrq output (77.21 KB, text/plain) 2004-11-12 23:23 UTC, John Caruso	no flags	Details
15:05 sar -q / AltSysrq output (77.21 KB, text/plain) 2004-11-12 23:26 UTC, John Caruso	no flags	Details
netdump output from a system crash (7/2/2005, 10:23am) (9.09 KB, text/plain) 2005-07-08 16:30 UTC, John Caruso	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2004:550	0	normal	SHIPPED_LIVE	Updated kernel packages available for Red Hat Enterprise Linux 3 Update 4	2004-12-20 05:00:00 UTC
Red Hat Product Errata	RHSA-2005:663	0	qe-ready	SHIPPED_LIVE	Important: Updated kernel packages available for Red Hat Enterprise Linux 3 Update 6	2005-09-28 04:00:00 UTC

Description John Caruso 2004-11-12 23:13:32 UTC

Description of problem:
The kernel appears to hang for a short period (15-45 seconds), 
rendering the system completely unresponsive during that time.

Version-Release number of selected component (if applicable):
kernel-hugemem-2.4.21-20.EL

This occurs on a system running Oracle 9i and Veritas VCS (and 
nothing else).  The system is a production database server, and so 
it's always busy to some extent, but we can't tie these events to any 
specific Oracle operations.  We see these hangs fairly frequently--
maybe 4-5 times per day--and they don't appear to correlate at all to 
the level of database activity (we've had them happen late at night 
when the database is only very lightly used).

Additional info:
The "signature" of this event is that the load average shoots up to 
about 80.  However, this happens *after* the hang, based on our 
debugging thus far, and appears to be a result of processes backing 
up while the kernel is hung.  The other symptom is that the system 
becomes completely unresponsive (during the hang, but before the load 
spike).

I've set up monitoring to collect AltSysrq m, w, and t output every 
15 seconds, as well as sar -q data (which helps pinpoint when the 
event occurs).  We've had several occurrences of the bug today for 
which I've got data, so I'll attach those files separate.  Perhaps by 
comparing the data from before and after, you can tell what might be 
happening in the interim?  Because there's nothing we can do to see 
that--the system is completely unresponsive, as I said.

If you need us to be collecting other AltSysrq info, let me know.

Comment 1 John Caruso 2004-11-12 23:20:31 UTC

Created attachment 106619 [details]
12:28 sar -q / AltSysrq output

The timing on this one appeared to be between 12:28:05 and 12:29:15.

Comment 2 John Caruso 2004-11-12 23:23:22 UTC

Created attachment 106620 [details]
14:22 sar -q/AltSysrq output

This event happened somewhere between 14:22:15 and 14:23:09.

Comment 3 John Caruso 2004-11-12 23:26:55 UTC

Created attachment 106621 [details]
15:05 sar -q / AltSysrq output

This event started between 15:05:17 and 15:05:51.  Also, this one is a bit more
typical than the other two (i.e., more like the ones we typically see).

Comment 4 Larry Woodman 2004-11-13 13:32:13 UTC

You need to run the latest RHEL3-U4 kernel, I fixed this problem of
blocking inside of wakeup_kswapd which is scattered throughout these
traceback.

-----------------------------------------------------------------------
SqlnetAgent   D 00000006     0 12851  12766               12850 (NOTLB)
Call Trace:   [<021246c7>] schedule [kernel] 0x2f7 (0x9e6e7de0)
[<02156d69>] wakeup_kswapd [kernel] 0xe9 (0x9e6e7e24)
[<02158c10>] __alloc_pages [kernel] 0xf0 (0x9e6e7e78)
-----------------------------------------------------------------------

This kernel is located here:

>>>http://people.redhat.com/~lwoodman/RHEL3/


Larry Woodman

Comment 5 John Caruso 2004-11-13 18:57:40 UTC

Thanks.  Do you have a hugemem version?  That's what I need for these 
systems.

Also, was there a bug filed already for this problem?  I couldn't 
find one when I searched.  Maybe you just noticed it and fixed it in 
the course of other bug hunts?

Comment 6 Ernie Petrides 2004-11-15 20:18:20 UTC

Hi, John.  Thanks for your bug report.  Larry found this problem
while investigating several other bugzillas, but to the best of our
knowledge, no other bugzilla was specifically resolved by Larry's fix.

The fix was committed to the RHEL3 U4 patch pool on 14-Sep-2004 (in
kernel version 2.4.21-20.6.EL).  The latest U4 kernel in the RHN beta
channel right now is 2.4.21-23.EL, although the most recently built U4
kernel, which is 2.4.21-25.EL, is undergoing internal Q/A and will
appear in the RHN beta channel soon.  The latter will most likely be
released in a couple of weeks as the official U4 kernel.

If you would like to test the -23.EL hugemem kernel (in a non-production
environment), please fetch it from the RHN beta channel.  However, if
you want to put a fixed U4 kernel into production use, you should wait
until the -25.EL kernels are available in the main RHEL3 RHN channel.

Comment 7 John Caruso 2004-11-15 20:25:59 UTC

Ernie: Larry already made the SMP version of the -25.EL kernel 
available; I'm just asking for the hugemem version.  I can't 
emphasize enough how important this is to us: we are experiencing 
multiple short production downtimes due to the problems with the -
20.EL kernel, and we need a more stable kernel as soon as possible.

I've already tested the -23.EL kernel, and it's toxic--it crashed a 
system within 10 hours of installation, and others have reported the 
same issue (see notes 110 through 112 in bug 132630).  It is not an 
option.  We're willing to test with the -25.EL kernel, though, and 
I'm already doing so with the SMP version that Larry provided--but we 
need the hugemem version to test on the machine where we're actually 
seeing this problem.  Thus my request that Larry make that one 
available, as he did with the SMP version.

Comment 8 John Caruso 2004-11-15 20:27:06 UTC

Sorry, that should have been bug 132639 (not 132630).

Comment 9 Ernie Petrides 2004-11-15 20:48:01 UTC

Okay.  Larry, could you please provide the -25.EL hugemem kernel to John?

(But please leave this bug in MODIFIED state.)

Comment 10 Larry Woodman 2004-11-15 21:29:08 UTC

John, I just move the hugemem kernel here:

>>>http://people.redhat.com/~lwoodman/RHEL3/


Larry

Comment 11 John Caruso 2004-11-15 21:52:58 UTC

Thanks, Larry.  I've got it installed on two of our systems now, and 
assuming that it seems stable over the next week we'll probably try 
it out on the server where we actually saw the hangs.

Comment 12 Ernie Petrides 2004-11-17 23:51:28 UTC

John, for future reference, the 2.4.21-25.EL kernels were pushed into
the RHN beta channel earlier today.  Thanks for your help with testing.

Comment 13 John Caruso 2004-11-18 19:47:53 UTC

Thanks.  I see that the MD5 sums for those files are different from 
the ones Larry provided...are there any substantive differences?  I'm 
basically concerned as to whether or not I should swap out 
the "official" beta kernels for the ones Larry provided.

BTW, we've not seen any misbehavior from the -25.EL kernel yet (but 
the problem described in this bug report only occurs on our 
production database server, and we're still testing -25.EL before we 
push it to that server, so I can't say that the bug here is actually 
resolved for us).

Comment 14 Larry Woodman 2004-11-18 20:16:52 UTC

John, the 2.4.21-25.EL kernels I made available to you were copied
from the same location as the ones available on the RHN beta channel.

Larry

Comment 15 John Caruso 2004-11-18 20:23:00 UTC

That doesn't really answer my question, though.  The kernels you 
provided have the following MD5 checksums:

a3abdb0547e252c11c0d858abc66c507  kernel-hugemem-2.4.21-25.EL.i686.rpm
ca52e1bc9adf56598165619ec516974f  kernel-smp-2.4.21-25.EL.i686.rpm

But the same kernels from the beta channel have the following 
checksums:

1b6cc68ce2a98567533eaf7ff181fed7  kernel-hugemem-2.4.21-25.EL.i686.rpm
ca30a7839c4f81fbdd65472e2e2b7c65  kernel-smp-2.4.21-25.EL.i686.rpm

And the files have different sizes as well.  They're not the same.  I 
just need to know whether or not there are any concrete differences 
between these two sets of kernels--e.g. maybe some additional patches 
are included in the beta versions--so that I know if I should replace 
the ones you provided with the beta kernels.

Comment 16 Ernie Petrides 2004-11-18 21:23:36 UTC

John, I've just tracked down the difference between the RPMs that
Larry provided (which were from my "official build" of 2.4.21-25.EL)
and the ones that are in the RHN beta channel.  Whenever RPMs are
released on RHN, even those pushed into the beta channel, they are
"signed" (which facilitates an authentication mechanism).  This act
of signing actually changes metadata in the RPM file itself, although
the vital "contents" (or "payload") is not altered.

The command for verifing an md5sum on the payload is as follows:

    rpm -pq --qf "%{SIGMD5}\n"  <rpm-name>

I have just verified that RPMs in both the relevant directory used by
RHN and the one where the originally built RPMs (which Larry used) reside
have identical payload md5sums:

    812ab05a1c41ee7cd6a13c7d957561ee    kernel-smp-2.4.21-25.EL.i686.rpm
    d771ed6939cf05b73fbc4d90d8d9740d    kernel-hugemem-2.4.21-25.EL.i686.rpm

In general, when a kernel becomes available in the RHN beta channel, which
just happened yesterday, it is better to use the RPMs from there (rather
than the RPMs from someone's "people" page).  But in this case, it was
advantageous to use what Larry made available for you so that you could
get them earlier.

When the U4 kernels are finally released in the main RHN channel, they are
signed with a different key.  So, besides the fact that we did one more
respin (for -26.EL) to resolve a minor security issue, the RPMs would have
a different external md5sum in the main RHN channel (from the beta channel).

I strongly recommend that you install/update production servers from the
main RHN channel when U4 is released.  In the meantime, it would be better
to upgrade systems through the RHN beta channel (simply from a support
perspective).

Comment 17 John Caruso 2004-11-19 01:11:04 UTC

Great, thanks for the explanation.  I verified that the payload 
checksums are identical on the ones I have.  Just checking to make 
sure we weren't missing anything important....

We'll be putting the -25.EL kernel into production this Saturday, and 
I'll let you know if there are any problems or if it doesn't resolve 
the pausing issue (so far we haven't seen any repeats of the hanging 
problem we saw with -23.EL, happily).

Comment 18 John Flanagan 2004-12-20 20:56:57 UTC

An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-550.html

Comment 19 John Caruso 2005-07-01 17:47:25 UTC

After months of debugging kernel corruption issues, we're now in a position to 
look at other/old problems, and I can tell you that this particular bug appeasr 
to not have been resolved by the errata mentioned.  We're currently running the 
2.4.21-32.4.ELsmp kernel on our production database servers and we've been 
seeing these same 15-45 second hangs several times over the past few weeks--the 
behavior and signatures of the events are all the same as they were when I 
originally reported this bug.

We no longer have the AltSysrq -m/-w/-t output collector set up.  Please let me 
know if you'll need that (and if so, specifically which ones you need, since 
they generate enormous amounts of output in various logs).

Comment 20 John Caruso 2005-07-05 18:29:43 UTC

I set up the AltSysrq -m/-w/-t collector again, and unfortunately it produced 
extreme system instability, resulting in a system hang (among other things).  
So we won't be able to collect that information.

Comment 21 Larry Woodman 2005-07-06 15:18:13 UTC

John, the AltSysrq M, T and W dont work for you when yuor system pauses for
several seconds?  What happens, panic or hangs?

Larry Woodman

Comment 22 John Caruso 2005-07-08 16:23:16 UTC

No, the AltSysrq problems are happening independently of the pauses.  We put the
AltSysrq data collection back in place in order to see what was happening during
the pauses (in the hopes that you could find a smoking gun as you did in comment
4), but when we did so, the system hung--not just for 30 seconds or so, but a
hard hang, requiring a powercycle to recover.  When I started up the AltSysrq
data collection on a subsequent reboot, it caused instability in Veritas Cluster
Server which caused a *companion* system (clustered with the first one) to
crash.  I don't have a memory dump from that crash, unfortunately, but I do have
some data from netdump, which I'll attach to this case; let me know if I should
file it as a separate bug.

It's possible that it's netdump that's causing AltSysrq to behave so badly,
since 1) we weren't running netdump when we originally collected the AltSysrq
data for you for this case and 2) I see that each AltSysrq run was producing an
output directory on the netdump server.  But no matter what it is, we can't have
that kind of instability on our production servers, so we can't collect AltSysrq
data.  So our ability to give you debugging info for this case is unfortunately
limited, now.

Comment 23 John Caruso 2005-07-08 16:30:44 UTC

Created attachment 116521 [details]
netdump output from a system crash (7/2/2005, 10:23am)

The crash that produced this output was pretty strange: running AltSysrq on one
server that's clustered with another server via Veritas Cluster Server (VCS)
caused the *second* server to crash.  Since VCS is the only connection between
the two systems, the crash must have occurred due to action on the part of VCS.
 This may just reflect a bug in VCS, of course, but the output actually says
"kernel BUG at slab.c:1143!", so I thought it would be worth sharing with you. 
If you think it's a VCS bug, let me know and I'll take it up with Veritas.

What's not in doubt, though, is that running the AltSysrq collection script on
the other server led to this crash...so whether or not it represents a bug in
VCS, it means that collecting AltSysrq data is dangerous for sites that use
VCS.

Comment 24 John Caruso 2005-07-08 16:33:23 UTC

FYI: "gab" (from the traceback) is global atomic broadcast--a protocol used by
VCS to keep the member systems of a cluster synchronized.

Comment 25 Ernie Petrides 2005-07-21 21:13:12 UTC

Larry, is your work on bug 161957 relate to this?  Should John try
the U6 beta candidate kernel with /proc/sys/vm/kscand_work_percent
set to a low value to break up kscand activity?

Comment 26 Larry Woodman 2005-08-15 20:50:35 UTC

Ernie, the patch tracking file for this bug is
1036.lwoodman.kscand-work-percnt.patch.  

John can you please grab the RHEL3-U6 beta kernel and test it out after setting
/proc/sys/vm/kscand_work_percent to 10 ?  This tunable defaults to 100 but if
one insists on running large Oracle systems without hugepages, this tunable
should be lowered to 10 or so.  This will prevent kscand from holding the zone
lru list lock for very long amounts of time thereby allowing other processes to
get the lock and run.

Larry Woodman

Comment 27 John Caruso 2005-08-19 20:59:36 UTC

We're about to swap out the database servers where we've seen this issue for
64-bit servers, so unfortunately I won't be able to test this fix for you.

Regarding "insists on": we'd love to have used hugepages, and all the other
RHEL/Oracle large memory features...but when we tried them out, each one of them
caused hangs, crashes, memory corruption, etc.  None of them were safe in a
production environment.  That's one of the main reasons that we're going to
64-bit RHEL--hopefully it will be more stable than 32-bit RHEL in large memory
Oracle configurations.

Comment 28 Ernie Petrides 2005-08-19 22:35:14 UTC

Fair enough, John.  Assuming you're moving to x86_64 systems that aren't
NUMA, you probably want to add "numa=off" to your kernel boot options.

We believe that this bugzilla is effectively a dup of bug 145950, and
that using the new tunable as described in comment #26 above sufficiently
addresses the issue.  There is also more info in bug 161957 comment #21.

Support for the "kscand_work_percent" tunable was committed to the
RHEL3 U6 patch pool on 15-Jul-2005 (in kernel version 2.4.21-32-12.EL).

Propagating acks from bug 145950 and moving to MODIFIED state.

Comment 30 John Caruso 2005-09-03 00:55:57 UTC

We believe we saw a pause today like the pauses described in this bug, but on a
RHEL3 x86_64 server running kernel 2.4.21-32.ELsmp (but it's difficult to be
sure because of the here-now-gone-a-few-seconds-later nature of the bug).  Do
you know if this bug also exists on that platform?

As far as we can tell, the pause seemed to happen between 10:58:00 and 11:00:01.
 Here's sar -B output from around that time:

10:58:00 AM  pgpgin/s pgpgout/s   fault/s  majflt/s
10:58:00 AM      0.07   1741.79      0.00      0.00
10:59:00 AM      0.00     27.58      0.00      0.00
11:00:00 AM      0.00      7.73      0.00      0.00
11:01:00 AM      0.13   3474.77      0.00      0.00
11:02:01 AM      0.13   5112.37      0.00      0.00
11:03:00 AM      0.14   5080.62      0.00      0.00
11:04:00 AM      0.13   3623.92      0.00      0.00

Would this perhaps be consistent with kscand kicking in during the target times,
and then something else (kswapd?) coming by a bit later to page out those pages...?

Also, the kscand_work_percent tunable doesn't exist on RHEL3 x86_64, so it
wasn't possible to try the workaround you suggested in comment 26.  Is there
some other tunable that performs that function on RHEL3 x86_64, or perhaps a
different tunable that would have the same effect?

Comment 31 Ernie Petrides 2005-09-03 01:06:20 UTC

Hi, John.  I'll let Larry follow up on Tuesday, but I can confirm that
the kscand_work_percent tunable is architecture-independent.  It should
exist in the U6 beta kernels (whereas -32.EL is U5).

Comment 32 John Caruso 2005-09-03 01:23:22 UTC

Thanks, Ernie--in my quick rescan of the bug I didn't get that
kscand_work_percent was actually introduced in the U6 kernel.  I'll see if we
can give that a test....

Comment 34 Eric Proust 2005-09-16 11:56:49 UTC

Hello,

we have a system with RH  2.4.21-32.0.1.ELhugemem + Veritas VCS and another with
2.4.21-15.ELsmp
An oracle database is running on it and we have the following problems :
1) boot time > 30 mn because VM seems to check its disks
2) the system seems to hang 10s a lot of time
3) approx once a week, we have a "VM crash" : vxio logs a memory problem then an
uncorrectable write error (see below). The only solution we have in this case is
to reboot the server

We've also detected that our servers are swapping a lot

Our hardware is : IBM X445 servers (8Go RAM), disk are on a SAN (IBM DS4300)


Do you have these type of symptoms ?


Sep 11 04:02:57 tahiti kernel: VxVM vxio V-5-0-474 Cannot allocate mem of size
32768 bytes1
Sep 11 04:02:58 tahiti kernel: VxVM vxio V-5-0-2 Volume MSoracle block 8937824:
Uncorrectable write error
Sep 11 04:02:58 tahiti kernel: VxVM vxio V-5-0-2 Volume MSoracle block 1093296:
Uncorrectable write error
Sep 11 04:02:58 tahiti kernel: VxVM vxio V-5-0-2 Volume MSoracle block 985424:
Uncorrectable write error

Comment 35 John Caruso 2005-09-16 19:15:32 UTC

Symptom 2 looks like this same issue, maybe, but we don't use VxVM so we haven't
seen symptom 1.  Symptom 3 looks it could be caused by bug 141394, which we
filed (and which Redhat has resolved), though it's hard to say since that bug's
behavior was so unpredictable.

Comment 36 Red Hat Bugzilla 2005-09-28 14:33:51 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-663.html

Note You need to log in before you can comment on or make changes to this bug.