Bug 139113
Summary: | System hangs for 15-45 seconds on RHEL3 / kernel 2.4.21-20.EL | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | John Caruso <jcaruso> | ||||||||||
Component: | kernel | Assignee: | Larry Woodman <lwoodman> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | |||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | medium | ||||||||||||
Version: | 3.0 | CC: | dff, lwang, peterm, petrides, riel, tkincaid | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | i686 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | RHSA-2005-663 | Doc Type: | Bug Fix | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2005-09-28 14:33:50 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 156320 | ||||||||||||
Attachments: |
|
Description
John Caruso
2004-11-12 23:13:32 UTC
Created attachment 106619 [details]
12:28 sar -q / AltSysrq output
The timing on this one appeared to be between 12:28:05 and 12:29:15.
Created attachment 106620 [details]
14:22 sar -q/AltSysrq output
This event happened somewhere between 14:22:15 and 14:23:09.
Created attachment 106621 [details]
15:05 sar -q / AltSysrq output
This event started between 15:05:17 and 15:05:51. Also, this one is a bit more
typical than the other two (i.e., more like the ones we typically see).
You need to run the latest RHEL3-U4 kernel, I fixed this problem of
blocking inside of wakeup_kswapd which is scattered throughout these
traceback.
-----------------------------------------------------------------------
SqlnetAgent D 00000006 0 12851 12766 12850 (NOTLB)
Call Trace: [<021246c7>] schedule [kernel] 0x2f7 (0x9e6e7de0)
[<02156d69>] wakeup_kswapd [kernel] 0xe9 (0x9e6e7e24)
[<02158c10>] __alloc_pages [kernel] 0xf0 (0x9e6e7e78)
-----------------------------------------------------------------------
This kernel is located here:
>>>http://people.redhat.com/~lwoodman/RHEL3/
Larry Woodman
Thanks. Do you have a hugemem version? That's what I need for these systems. Also, was there a bug filed already for this problem? I couldn't find one when I searched. Maybe you just noticed it and fixed it in the course of other bug hunts? Hi, John. Thanks for your bug report. Larry found this problem while investigating several other bugzillas, but to the best of our knowledge, no other bugzilla was specifically resolved by Larry's fix. The fix was committed to the RHEL3 U4 patch pool on 14-Sep-2004 (in kernel version 2.4.21-20.6.EL). The latest U4 kernel in the RHN beta channel right now is 2.4.21-23.EL, although the most recently built U4 kernel, which is 2.4.21-25.EL, is undergoing internal Q/A and will appear in the RHN beta channel soon. The latter will most likely be released in a couple of weeks as the official U4 kernel. If you would like to test the -23.EL hugemem kernel (in a non-production environment), please fetch it from the RHN beta channel. However, if you want to put a fixed U4 kernel into production use, you should wait until the -25.EL kernels are available in the main RHEL3 RHN channel. Ernie: Larry already made the SMP version of the -25.EL kernel available; I'm just asking for the hugemem version. I can't emphasize enough how important this is to us: we are experiencing multiple short production downtimes due to the problems with the - 20.EL kernel, and we need a more stable kernel as soon as possible. I've already tested the -23.EL kernel, and it's toxic--it crashed a system within 10 hours of installation, and others have reported the same issue (see notes 110 through 112 in bug 132630). It is not an option. We're willing to test with the -25.EL kernel, though, and I'm already doing so with the SMP version that Larry provided--but we need the hugemem version to test on the machine where we're actually seeing this problem. Thus my request that Larry make that one available, as he did with the SMP version. Sorry, that should have been bug 132639 (not 132630). Okay. Larry, could you please provide the -25.EL hugemem kernel to John? (But please leave this bug in MODIFIED state.)
John, I just move the hugemem kernel here:
>>>http://people.redhat.com/~lwoodman/RHEL3/
Larry
Thanks, Larry. I've got it installed on two of our systems now, and assuming that it seems stable over the next week we'll probably try it out on the server where we actually saw the hangs. John, for future reference, the 2.4.21-25.EL kernels were pushed into the RHN beta channel earlier today. Thanks for your help with testing. Thanks. I see that the MD5 sums for those files are different from the ones Larry provided...are there any substantive differences? I'm basically concerned as to whether or not I should swap out the "official" beta kernels for the ones Larry provided. BTW, we've not seen any misbehavior from the -25.EL kernel yet (but the problem described in this bug report only occurs on our production database server, and we're still testing -25.EL before we push it to that server, so I can't say that the bug here is actually resolved for us). John, the 2.4.21-25.EL kernels I made available to you were copied from the same location as the ones available on the RHN beta channel. Larry That doesn't really answer my question, though. The kernels you provided have the following MD5 checksums: a3abdb0547e252c11c0d858abc66c507 kernel-hugemem-2.4.21-25.EL.i686.rpm ca52e1bc9adf56598165619ec516974f kernel-smp-2.4.21-25.EL.i686.rpm But the same kernels from the beta channel have the following checksums: 1b6cc68ce2a98567533eaf7ff181fed7 kernel-hugemem-2.4.21-25.EL.i686.rpm ca30a7839c4f81fbdd65472e2e2b7c65 kernel-smp-2.4.21-25.EL.i686.rpm And the files have different sizes as well. They're not the same. I just need to know whether or not there are any concrete differences between these two sets of kernels--e.g. maybe some additional patches are included in the beta versions--so that I know if I should replace the ones you provided with the beta kernels. John, I've just tracked down the difference between the RPMs that Larry provided (which were from my "official build" of 2.4.21-25.EL) and the ones that are in the RHN beta channel. Whenever RPMs are released on RHN, even those pushed into the beta channel, they are "signed" (which facilitates an authentication mechanism). This act of signing actually changes metadata in the RPM file itself, although the vital "contents" (or "payload") is not altered. The command for verifing an md5sum on the payload is as follows: rpm -pq --qf "%{SIGMD5}\n" <rpm-name> I have just verified that RPMs in both the relevant directory used by RHN and the one where the originally built RPMs (which Larry used) reside have identical payload md5sums: 812ab05a1c41ee7cd6a13c7d957561ee kernel-smp-2.4.21-25.EL.i686.rpm d771ed6939cf05b73fbc4d90d8d9740d kernel-hugemem-2.4.21-25.EL.i686.rpm In general, when a kernel becomes available in the RHN beta channel, which just happened yesterday, it is better to use the RPMs from there (rather than the RPMs from someone's "people" page). But in this case, it was advantageous to use what Larry made available for you so that you could get them earlier. When the U4 kernels are finally released in the main RHN channel, they are signed with a different key. So, besides the fact that we did one more respin (for -26.EL) to resolve a minor security issue, the RPMs would have a different external md5sum in the main RHN channel (from the beta channel). I strongly recommend that you install/update production servers from the main RHN channel when U4 is released. In the meantime, it would be better to upgrade systems through the RHN beta channel (simply from a support perspective). Great, thanks for the explanation. I verified that the payload checksums are identical on the ones I have. Just checking to make sure we weren't missing anything important.... We'll be putting the -25.EL kernel into production this Saturday, and I'll let you know if there are any problems or if it doesn't resolve the pausing issue (so far we haven't seen any repeats of the hanging problem we saw with -23.EL, happily). An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-550.html After months of debugging kernel corruption issues, we're now in a position to look at other/old problems, and I can tell you that this particular bug appeasr to not have been resolved by the errata mentioned. We're currently running the 2.4.21-32.4.ELsmp kernel on our production database servers and we've been seeing these same 15-45 second hangs several times over the past few weeks--the behavior and signatures of the events are all the same as they were when I originally reported this bug. We no longer have the AltSysrq -m/-w/-t output collector set up. Please let me know if you'll need that (and if so, specifically which ones you need, since they generate enormous amounts of output in various logs). I set up the AltSysrq -m/-w/-t collector again, and unfortunately it produced extreme system instability, resulting in a system hang (among other things). So we won't be able to collect that information. John, the AltSysrq M, T and W dont work for you when yuor system pauses for several seconds? What happens, panic or hangs? Larry Woodman No, the AltSysrq problems are happening independently of the pauses. We put the AltSysrq data collection back in place in order to see what was happening during the pauses (in the hopes that you could find a smoking gun as you did in comment 4), but when we did so, the system hung--not just for 30 seconds or so, but a hard hang, requiring a powercycle to recover. When I started up the AltSysrq data collection on a subsequent reboot, it caused instability in Veritas Cluster Server which caused a *companion* system (clustered with the first one) to crash. I don't have a memory dump from that crash, unfortunately, but I do have some data from netdump, which I'll attach to this case; let me know if I should file it as a separate bug. It's possible that it's netdump that's causing AltSysrq to behave so badly, since 1) we weren't running netdump when we originally collected the AltSysrq data for you for this case and 2) I see that each AltSysrq run was producing an output directory on the netdump server. But no matter what it is, we can't have that kind of instability on our production servers, so we can't collect AltSysrq data. So our ability to give you debugging info for this case is unfortunately limited, now. Created attachment 116521 [details]
netdump output from a system crash (7/2/2005, 10:23am)
The crash that produced this output was pretty strange: running AltSysrq on one
server that's clustered with another server via Veritas Cluster Server (VCS)
caused the *second* server to crash. Since VCS is the only connection between
the two systems, the crash must have occurred due to action on the part of VCS.
This may just reflect a bug in VCS, of course, but the output actually says
"kernel BUG at slab.c:1143!", so I thought it would be worth sharing with you.
If you think it's a VCS bug, let me know and I'll take it up with Veritas.
What's not in doubt, though, is that running the AltSysrq collection script on
the other server led to this crash...so whether or not it represents a bug in
VCS, it means that collecting AltSysrq data is dangerous for sites that use
VCS.
FYI: "gab" (from the traceback) is global atomic broadcast--a protocol used by VCS to keep the member systems of a cluster synchronized. Larry, is your work on bug 161957 relate to this? Should John try the U6 beta candidate kernel with /proc/sys/vm/kscand_work_percent set to a low value to break up kscand activity? Ernie, the patch tracking file for this bug is 1036.lwoodman.kscand-work-percnt.patch. John can you please grab the RHEL3-U6 beta kernel and test it out after setting /proc/sys/vm/kscand_work_percent to 10 ? This tunable defaults to 100 but if one insists on running large Oracle systems without hugepages, this tunable should be lowered to 10 or so. This will prevent kscand from holding the zone lru list lock for very long amounts of time thereby allowing other processes to get the lock and run. Larry Woodman We're about to swap out the database servers where we've seen this issue for 64-bit servers, so unfortunately I won't be able to test this fix for you. Regarding "insists on": we'd love to have used hugepages, and all the other RHEL/Oracle large memory features...but when we tried them out, each one of them caused hangs, crashes, memory corruption, etc. None of them were safe in a production environment. That's one of the main reasons that we're going to 64-bit RHEL--hopefully it will be more stable than 32-bit RHEL in large memory Oracle configurations. Fair enough, John. Assuming you're moving to x86_64 systems that aren't NUMA, you probably want to add "numa=off" to your kernel boot options. We believe that this bugzilla is effectively a dup of bug 145950, and that using the new tunable as described in comment #26 above sufficiently addresses the issue. There is also more info in bug 161957 comment #21. Support for the "kscand_work_percent" tunable was committed to the RHEL3 U6 patch pool on 15-Jul-2005 (in kernel version 2.4.21-32-12.EL). Propagating acks from bug 145950 and moving to MODIFIED state. We believe we saw a pause today like the pauses described in this bug, but on a RHEL3 x86_64 server running kernel 2.4.21-32.ELsmp (but it's difficult to be sure because of the here-now-gone-a-few-seconds-later nature of the bug). Do you know if this bug also exists on that platform? As far as we can tell, the pause seemed to happen between 10:58:00 and 11:00:01. Here's sar -B output from around that time: 10:58:00 AM pgpgin/s pgpgout/s fault/s majflt/s 10:58:00 AM 0.07 1741.79 0.00 0.00 10:59:00 AM 0.00 27.58 0.00 0.00 11:00:00 AM 0.00 7.73 0.00 0.00 11:01:00 AM 0.13 3474.77 0.00 0.00 11:02:01 AM 0.13 5112.37 0.00 0.00 11:03:00 AM 0.14 5080.62 0.00 0.00 11:04:00 AM 0.13 3623.92 0.00 0.00 Would this perhaps be consistent with kscand kicking in during the target times, and then something else (kswapd?) coming by a bit later to page out those pages...? Also, the kscand_work_percent tunable doesn't exist on RHEL3 x86_64, so it wasn't possible to try the workaround you suggested in comment 26. Is there some other tunable that performs that function on RHEL3 x86_64, or perhaps a different tunable that would have the same effect? Hi, John. I'll let Larry follow up on Tuesday, but I can confirm that the kscand_work_percent tunable is architecture-independent. It should exist in the U6 beta kernels (whereas -32.EL is U5). Thanks, Ernie--in my quick rescan of the bug I didn't get that kscand_work_percent was actually introduced in the U6 kernel. I'll see if we can give that a test.... Hello, we have a system with RH 2.4.21-32.0.1.ELhugemem + Veritas VCS and another with 2.4.21-15.ELsmp An oracle database is running on it and we have the following problems : 1) boot time > 30 mn because VM seems to check its disks 2) the system seems to hang 10s a lot of time 3) approx once a week, we have a "VM crash" : vxio logs a memory problem then an uncorrectable write error (see below). The only solution we have in this case is to reboot the server We've also detected that our servers are swapping a lot Our hardware is : IBM X445 servers (8Go RAM), disk are on a SAN (IBM DS4300) Do you have these type of symptoms ? Sep 11 04:02:57 tahiti kernel: VxVM vxio V-5-0-474 Cannot allocate mem of size 32768 bytes1 Sep 11 04:02:58 tahiti kernel: VxVM vxio V-5-0-2 Volume MSoracle block 8937824: Uncorrectable write error Sep 11 04:02:58 tahiti kernel: VxVM vxio V-5-0-2 Volume MSoracle block 1093296: Uncorrectable write error Sep 11 04:02:58 tahiti kernel: VxVM vxio V-5-0-2 Volume MSoracle block 985424: Uncorrectable write error Symptom 2 looks like this same issue, maybe, but we don't use VxVM so we haven't seen symptom 1. Symptom 3 looks it could be caused by bug 141394, which we filed (and which Redhat has resolved), though it's hard to say since that bug's behavior was so unpredictable. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-663.html |