Bug 117460
Summary: | SSH fails in various contexts with ENOBUFS error | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 2.1 | Reporter: | John Caruso <jcaruso> | ||||||
Component: | kernel | Assignee: | Jason Baron <jbaron> | ||||||
Status: | CLOSED ERRATA | QA Contact: | |||||||
Severity: | high | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 2.1 | CC: | alan, fenlason, knoel, mgalgoci, riel | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | i686 | ||||||||
OS: | Linux | ||||||||
URL: | http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=75108 | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2004-08-18 14:25:08 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 116726 | ||||||||
Attachments: |
|
Description
John Caruso
2004-03-04 08:57:06 UTC
2.4.9-34enterprise is not an AS2.1 kernel, AS2.1 uses a different versioning scheme and independent numbers, 2.4.9-e.38 is well over 2 years newer than 2.4.9-34enterprise. Ok, thanks for the clarification on that point--I was misled by the fact that "34" was about the correct release level at that time for RHAS2.1, and the word "enterprise" in the kernel name. Regardless of that, we are in fact seeing all of the same symptoms as those in bug 75108, but on a system running the latest AS2.1 errata kernel release (kernel-smp-2.4.9-e.38). It seems likely that this is the same bug manifesting itself on a similar kernel release. If the knowledge that the symptoms are the same as in bug 75108 and the apparent fact that this is a known bug (with a known resolution) are not sufficient to address this bug on AS2.1, I can try to capture whatever information (strace, tcpdump, etc) you may need to troubleshoot it. What's up with this bug? It appears that the exchange above led to it falling through the cracks. It's a live bug. Larry, assuming this really is your bug at this point: what you said about buffermem in bug 117902 would also seem to apply to this bug (I'd always assumed they were related anyway, since both bugs show up at the same time when a system is low on memory). Since Arjan claimed that the ENOBUFS/SSH bug was fixed in a later version of the kernel, it seems likely that that that fix could just be backported to RHAS 2.1...and it also seems likely that the fix is the same as the fix for bug 117902 (or is closely related). John, can you try out the same kernel that I posted in Bug 117902 and see if it fixes this problem as well? It appears that the ENOBUFS error is also a side effect of running out of lowmem. Larry We have a system that has gotten the ENOBUFS error fairly regularly in the past with an hourly rsync cron job--so much so that I'd filtered that output so I wouldn't see it. I've removed that filter and I'm now waiting for the system to give me the error (which manifests itself as "writefd_unbuffered failed to write X bytes" with rsync), and once it does I'll try updating to your new kernel and see if that fixes it. Unfortunately it may be tough to give a definite positive answer, because the problem itself doesn't happen predictably--I've had the filter off since yesterday and the cron job hasn't complained yet. So of course 30 minutes after I posted that response, the cron job started producing the error--and then I was able to reproduce it regularly myself as well. So I've gone ahead with the update to the e.39.1 kernel you provided, and we'll see if we get the ENOBUFS error again with that kernel. As I said, though, it'll be tough to give a definitive positive answer until we see the system run for a reasonably long time (weeks) without producing the error. John, if you do start seeing ENOBUFS error messages please collect as mant AltSysrq-M outputs as you can around that time. If you can not do that let me know and I'll instrument a special kernel that does it automatically when the kernel memory allocation failure occurs. Larry We hit the ENOBUFS error today (with the e.39.1 kernel). Here's the AltSysrq-M output, from about two minutes after I received the cron job output indicating the ENOBUFS error: SysRq : Show Memory Mem-info: Free pages: 126272kB ( 78844kB HighMem) ( Active: 290921, inactive_dirty: 75731, inactive_clean: 42081, free: 31568 (638 1276 1914) ) 483*4kB 119*8kB 38*16kB 1*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3524kB active: 456, inactive_dirty: 444, inactive_clean: 0, free: 881 (128 256 384) 9838*4kB 549*8kB 10*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 43904kB active: 114401, inactive_dirty: 39013, inactive_clean: 0, free: 10976 (255 510 765) 14335*4kB 2154*8kB 145*16kB 19*32kB 7*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 78844kB active: 176064, inactive_dirty: 36274, inactive_clean: 42081, free: 19711 (255 510 765) Swap cache: add 7620784, delete 7412937, find 7084473/7188442 Page cache size: 363792 Buffer mem: 44941 Ramdisk pages: 0 Free swap: 2598996kB 522224 pages of RAM 292848 pages of HIGHMEM 9532 reserved pages 141539 pages shared 207849 pages swap cached 34 pages in page table cache 41773 pages in slab cache Buffer memory: 179764kB CLEAN: 45159 buffers, 180168 kbyte, 140 used (last=45044), 0 locked, 0 protected, 0 dirty DIRTY: 195 buffers, 780 kbyte, 195 used (last=195), 0 locked, 0 protected, 186 dirty Now that it's happened once, it'll probably start happening fairly regularly. I just ran the rsync cron job manually and it again encountered the ENOBUFS error. Here's AltSysrq-M output from immediately after that error: SysRq : Show Memory Mem-info: Free pages: 147708kB (141560kB HighMem) ( Active: 267271, inactive_dirty: 77154, inactive_clean: 19024, free: 36927 (638 1276 1914) ) 381*4kB 114*8kB 45*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3156kB active: 897, inactive_dirty: 11, inactive_clean: 0, free: 789 (128 256 384) 254*4kB 223*8kB 12*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2992kB active: 100844, inactive_dirty: 47161, inactive_clean: 0, free: 748 (255 510 765) 21358*4kB 6262*8kB 251*16kB 21*32kB 7*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 141560kB active: 165530, inactive_dirty: 29982, inactive_clean: 19024, free: 35390 (255 510 765) Swap cache: add 7626172, delete 7447138, find 7098345/7203352 Page cache size: 315479 Buffer mem: 47970 Ramdisk pages: 0 Free swap: 2736188kB 522224 pages of RAM 292848 pages of HIGHMEM 9532 reserved pages 134025 pages shared 179036 pages swap cached 42 pages in page table cache 51258 pages in slab cache Buffer memory: 191880kB CLEAN: 48266 buffers, 192596 kbyte, 152 used (last=48261), 0 locked, 0 protected, 0 dirty LOCKED: 53 buffers, 212 kbyte, 53 used (last=53), 0 locked, 0 protected, 0 dirty DIRTY: 112 buffers, 448 kbyte, 112 used (last=112), 0 locked, 0 protected, 110 dirty Our production database server also produced an ENOBUFS error today while running the e.39.1 test kernel. The error was produced when another system ran a command like "ssh <dbhost> cat <file>", and the error was actually from cat, not ssh: "/bin/cat: write error: No buffer space available". (Sorry, no AltSysrq-M output for that one since I wasn't around when it happened.) Created attachment 98842 [details]
AF_UNIX test case
The ENOBUFS appears to be on a unix domain socket, not a tcp socket. Please
try this test case on affected systems. When given an argument above about
32k, it fails on my affected systems. It passes on unaffected systems.
sh -c './mg 65536; echo $?'
Zero is passing; anything else is a bitmapped error code. See the source for
details.
Michael: your program returns zero on systems of ours which regularly produce the ENOBUFS error, no matter what size I use (65536, 131072, 262144, etc). Because of the structure of ssh now, it's tricky to follow the sequence of descriptor manipulations (through the pipe, fork/exec, close, dup, etc calls) to determine whether or the particular descriptor in question is associated with a TCP socket. There's no doubt that it happens on a data write, though (the contents of the failing write are contents from the copied file). I'm trying to capture a sufficiently terse strace that exhibits the error, but it appears that the new kernel has enough of an effect that it makes the problem harder to reproduce consistently--but without actually resolving it. Larry: if you're waiting for more info from us, please say so...otherwise I'm assuming the AltSysrq-M output I provided above is enough. ok, i've been trying to reproduce this issue, but its been really difficult to artificially re-produce as indicated. I have a good sense as to what the problem is and i think that we should try: 'echo 4096 > /proc/sys/net/core/wmem_default'. I think this might resolve the issue and then we can isolate the change perhaps to just affect unix sockets, although i don't think this value has as wide reaching affects on the network layer as might be expected. Any chance we can test this change? ok, i've been trying to reproduce this issue, but its been really difficult to artificially re-produce as indicated. I have a good sense as to what the problem is and i think that we should try: 'echo 4096 > /proc/sys/net/core/wmem_default'. I think this might resolve the issue and then we can isolate the change perhaps to just affect unix sockets, although i don't think this value has as wide reaching affects on the network layer as might be expected. Any chance we can test this change? so here's what i was thinking of for a patch: --- linux/net/unix/af_unix.c.bak Wed Mar 31 13:35:42 2004 +++ linux/net/unix/af_unix.c Wed Mar 31 14:27:36 2004 @@ -1344,6 +1344,11 @@ static int unix_stream_sendmsg(struct so */ skb=sock_alloc_send_skb(sk,size,msg->msg_flags&MSG_DONTWAIT, &err); + + while((err == -ENOBUFS) && (size > 4096)) { + size >>= 2; + skb = sock_alloc_send_skb(sk,size,msg->msg_flags&MSG_DONTWAIT, &err); + } if (skb==NULL) goto out_err; As long as we are getting -ENOBUFS and greater than a page size retry the allocation asking for half as much memory. updated patch: --- linux/net/unix/af_unix.c.bak Wed Mar 31 13:35:42 2004 +++ linux/net/unix/af_unix.c Wed Mar 31 14:35:18 2004 @@ -1344,6 +1344,11 @@ static int unix_stream_sendmsg(struct so */ skb=sock_alloc_send_skb(sk,size,msg->msg_flags&MSG_DONTWAIT, &err); + + while((err == -ENOBUFS) && (size > PAGE_SIZE)) { + size >>= 1; + skb = sock_alloc_send_skb(sk,size,msg->msg_flags&MSG_DONTWAIT, &err); + } if (skb==NULL) goto out_err; Ok, I've set wmem_default to 4096 as requested (on our machine that currently has an rsync job failing with ENOBUFS about 1-2 times every few days). We'll see what happens. Does the fact that Michael's mg program worked with all buffer sizes we tried mean that it's unlikely this patch would fix things, though? I suppose not, since it's possible that at the time I was running mg the system wasn't in that magic state that causes the ENOBUFS to occur.... With the patch, what happens if an allocation attempt for PAGE_SIZE bytes fails? And mightn't the caller be adversely affected by getting back a smaller buffer than expected? (Mainly rhetorical questions--I'm sure you're thinking about those things.) On reproducing the problem: it seemed to be a good bit easier to reproduce under e.38--I had several instances in which I'd narrowed it down to an scp of a single not-ridiculously-large (<300MB) file, reproducible at will. So if you want to make it easier to reproduce, you might try falling back to e.38. I may do that on our system as well, unless there's a good reason not to (since we already know that e.39.1 doesn't fix the problem outright). The fact that the program worked with all buffer sizes, i don't think makes this patch less likely to work. The theory is that we can't get a large contiguous piece of memory b/c memory is fragmented in a unique way. Reproducing this unique state of memory is difficult b/c the system is so dynamic allocating and freeing pages. if the attempt for PAGE_SIZE fails, we end up with -ENOBUFS again. however, it should be impossible to get the system into a state where not ven one page is available. In terms of the smaller buffer, this just means we are passing things through the network layer in a smaller unit. we are still going to be passing the same amount of data, but we break it up into smaller chunks. This could adversely affact performance, and that is why i did the 'while' loop that first tries for the requested size and then falls back to smaller and smaller sizes. I think the /proc thing will give us a good indication as to whether or not this patch is the right idea in addressing this issue. Thanks for doing that, and please let us know if you see any noticable performance changes with this change. thanks. Ok, I've got some very useful feedback for you. We have an rsync cron job that runs frequently on the affected production server but only rsyncs a relatively small amount of data, and it just started failing predictably (finally!). On this system, wmem_default is set to 262143 as per many recommendations for NFS clients like this one, and the rsync job in question rsyncs files from a local filesystem to an NFS-mounted filesystem (i.e., SSH is not involved). I ran the script in question several times with wmem_default set to 262143 and it failed each time, but then I changed wmem_default to 4096 and immediately ran it again and it succeeded--so the workaround does appear to have the effect you'd thought it would. Also, FYI, when I then changed wmem_default back to 262143, the script again succeeded. I have several straces of the failing rsyncs--if you need them, let me know (I won't just post them otherwise because they contain some sensitive data, and it would take time to clean them up before posting them). ok, thanks for the informantion. So it looks like we are getting close hopefully... I've got a kernel built with the patch above, although i have to admit it hasn't seen much testing yet. The patch should obey the large setting for wmem_default, unless it can't get the memory segment requested, so i think that performance should be the same as before. Test kernel is below, it also contians some additional fixes, most notably the patches discussed in 117902. http://people.redhat.com/~jbaron/.private/testing/2.4.9-e.39.9.test/ I've installed the e.39.9.test kernel on our server where the ENOBUFS errors occur frequently, and I'll reboot it as soon as possible (and reset wmem_default to the previous setting) to see if it fixes the problem. FYI, we've not had any more ENOBUFS errors on this system with wmem_default set to 4096. thanks! looking forward to hearing feedbak. I don't want to sound pushy, but any updates on this one? this is a very critical bug, and we're looking for as much feedback on this one as possible. thanks. No ENOBUFS errors so far on the system that's running 2.4.9- e.39.9.testsmp, since its reboot on Friday. I hadn't posted any feedback about it yet because this system didn't always produce ENOBUFS errors within 4 days in the past, so there's still a chance that the bug will occur in the future. Based on past behavior, I'd say that a week of error-free behavior should be enough for a fairly high degree of confidence that it's fixed. (I do expect that the patch will successfully work around the bug, though, based on the wmem_default testing.) thanks for the confidence, and the update. how does the system look now that the system has presumably been running the updated kernel for more than 1 week. Any -ENOBUFS? We've got good news and bad news. The good news is that there are no more ENOBUFS errors. Which is as expected; the code you've added would seem to make them almost impossible. But the bad news is that the removal of the ENOBUFS errors appears to have resulted in another potentially more serious problem. The cron job in question rsyncs files from a local disk to an NFS- mounted directory. With the e.39.9.testsmp, NFS is now hanging intermittently while attempting to read certain files--the rsync will sit there holding the file open (on the destination NFS volume) for hours, looping on a select() call, until it's finally killed. This is causing errors in succeeding cron jobs, as they attempt to delete the opened files (actually the first deletion works and results in a .nfs file, but the later attempted deletion of the .nfs file is what produces the error). We've been running this cron job for several years, and this has never happened before. The fact that it only happened now with the e.39.9.testsmp kernel, and the fact that it started occurring in the uptime cycle right about when we would previously have started seeing ENOBUFS errors, would seem to indicate that it's a direct result of the workaround. Apparently NFS doesn't always deal well with receiving buffers that are smaller than the ones it requested. This is potentially a more serious bug for us than the original bug, since we rely on NFS for critical production functions. It's bad to have rsync/ssh die with ENOBUFS errors, but to have NFS suddenly hang while reading critical data would be worse. Let me know if you need any other info. I've collected data from two separate incidents so far, but I'm not sure how useful it would be to you. In particular, I've got AltSysrq-M output, but it was taken at the point when I received the cron job errors, not the point at which the rsync NFS read actually hung--so it may not be that useful. The problem is occurring about 1-2 times per day. hmmm, well nfs doesn't use unix domain sockets for transport, they are only for local sockets, so i'm not sure exactly what's going on here, but i certainly agree its related to the patch. Perhaps, its not quite complete. I'll re-review it tomorrow. Also, what is the setting for wmem_default during this time? All kernel tunables related to networking are at defaults on this system, so wmem_default is set to 65535. What you're saying implies that wmem_default only affects unix domain sockets, correct? I'm surprised if that's the case, since the NFS recommendations I've seen frequently discuss changing wmem_default to improve NFS-over-UDP performance (e.g. http://www.netapp.com/tech_library/3183.html). In any case, I don't know if the new problem is related to NFS or is happening within rsync, so that was just speculation on my part. If it helps, the file on which rsync hangs is always on the destination, not the source (and the destination in this case is NFS-mounted). wmem_default is supposed to be a system wide setting, so in theory it could affect UDP buffer sizes and hence NFS performance. However, the patch that i did is limited to unix sockets, so i would have expected the patch to have a much more limited affect then chaning wmem_default. Right, likely this change is probably affected how rsync is communicating via unix sockets, and thus 'apparently' caused by NFS. Ok, got it--I didn't realize your patch affected unix domain sockets only. In looking at the lsof/strace output for the rsync processes that are holding open the files on the destination, it does look like the select() call is hanging while trying to read a unix domain socket (and not the destination file). Here's one example: COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME rsync 27504 root cwd DIR 0,12 4096 251498 /path/to/file (server:/mount) rsync 27504 root rtd DIR 8,3 4096 2 / rsync 27504 root txt REG 8,5 202212 33250 /usr/bin/rsync rsync 27504 root mem REG 8,3 495474 128034 /lib/ld- 2.2.4.so rsync 27504 root mem REG 8,5 29500 64285 /usr/lib/libpopt.so.0.0.0 rsync 27504 root mem REG 8,3 263059 128104 /lib/libresolv-2.2.4.so rsync 27504 root mem REG 8,3 5797952 16316 /lib/i686/libc-2.2.4.so rsync 27504 root mem REG 8,3 262396 128096 /lib/libnss_files-2.2.4.so rsync 27504 root 0u unix 0xeeacc040 10595151 socket rsync 27504 root 1r REG 0,12 3232829 2944263 /path/to/file/.nfs007abc0400000020 (server:/mount) rsync 27504 root 2w FIFO 0,0 10595150 pipe rsync 27504 root 4u unix 0xe5202580 10598232 socket rsync 27504 root 6u unix 0xe4f3e580 10598234 socket # strace -fp 27504 Process 27504 attached - interrupt to quit select(1, [0], NULL, NULL, {54, 30000}) = 0 (Timeout) select(1, [0], NULL, NULL, {60, 0}) = 0 (Timeout) select(1, [0], NULL, NULL, {60, 0} <unfinished ...> John, Any chance you have more of the strace of the rsync process? For example, what rsync was doing before hanging in select. thanks. Unfortunately no--that strace was taken well after the process had hung. It would be difficult to get an strace of the actual process as it's hanging, since it's a cron job that's rsync'ing a large volume of data (>15GB) and it runs every hour. And strace'ing it generates so much data that it may actually affect the behavior of the bug itself. If you need that to go forward, though, let me know and I'll see if it's practicable to strace each instance of the cron job (although then we'll have the problem of uploading the strace, which I believe was several hundred MB when I tried it once before...). Unfortunatly, without more context in the rsync strace, there's no way to tell whether it's a kernel problem (related to the patch) or a userspace problem (a bug in rsync for me to look into). Fortunatly, you won't need to upload the entire strace--the last few hundred lines of it should do. Also, what version of rsync are you running? We're running rsync 2.5.7 (from the rsync-2.5.7-0.7 RPM, which is the most current one on RHN). The ENOBUFS errors have occurred since well back into rsync 2.4.x. Just as a reminder that the original underlying issue isn't only with rsync, please note comment 10 above; we also occasionally saw the ENOBUFS errors with scp. I've added strace'ing to the rsync cron job, so now we'll just have to wait for the error to occur again. Ok, the rsync cron job running under strace finally started failing again. I'm attaching a section of the end of the strace which I believe encompasses as much as is useful from that section; it starts with a sequence of lstat64 calls (of which there were thousands, in sequence--I've included just the last few), and continues through the point at which the rsync hangs. I've changed names to protect the innocent. If you need more, there's plenty--the full output is 263MB / 4 million lines. Created attachment 99724 [details]
strace excerpt of hanging rsync cron job (under kernel e.39.9.testsmp)
strace excerpt of hanging rsync cron job (under kernel e.39.9.testsmp)
I've been looking at the strace output, and it looks like the sending rsync writes 20106 bytes of file data to the socket, then does a select() for writability on the socket. The receiving rsync reads and processes all the data from the socket (data from a previous file and then the 20106 bytes of file data). The receiver then does a select for readability on the socket. The sender's select times out, so it tries again. The retry times out, so it tries again. . . So the kernel is never telling the sender that it can write to the socket, even after the receiver has red all the data. I can try writing a quick test program that attempts to replicate rsync's behavior here, if you need final proof, but it's clear to me the problem isn't in rsync itself. Yes, i've looked into this a bit more a the way the kernel does reference counting for the case where a buffer is broken up is incorrect and would result in this type of hang. i'll post a patch, and a test kernel shortly. ok, here is the corrected patch. I should have been resetting err to 0, b/c the -ENOBUFS seems like it would stick to its orignal value. This caused the wmem_alloc param to never drop, and thus we apparently had no room in the buffer to write. Below should be the correct patch: --- linux/net/unix/af_unix.c.bak Wed Mar 31 13:35:42 2004 +++ linux/net/unix/af_unix.c Wed Mar 31 14:35:18 2004 @@ -1344,6 +1344,12 @@ static int unix_stream_sendmsg(struct so */ skb=sock_alloc_send_skb(sk,size,msg->msg_flags&MSG_DONTWAIT, &err); + + while((err == -ENOBUFS) && (size > PAGE_SIZE)) { + size >>= 1; + err = 0; + skb = sock_alloc_send_skb(sk,size,msg->msg_flags&MSG_DONTWAIT, &err); + } if (skb==NULL) goto out_err; ok, here's a kernel with the corrected patch: http://people.redhat.com/~jbaron/.private/testing/2.4.9-e.40.4.test/ Installed (as of last night). Now we'll just have to play the waiting game again to see if nothing happens. any updates? The rsync cron job has been behaving, and we haven't seen any other odd behavior. As before, I was waiting for a week of error-free behavior to say that it appears to be fixed...and we just reached that today. So as far as I can tell it appears that this patch successfully works around the ENOBUFS issue without introducing any other issues. Thanks for all the work. What would be the schedule for rolling this out in a new kernel release? And is the plan to combine this fix with the kswapd fix from bug 117902? John, we did combine this with the kswapd fix, can you test the latest kernel in: http://people.redhat.com/~jbaron/.private/testing/2.4.9-e.40.6.test/ Larry Woodman BTW, we improved the fix for bug 117902 to more aggressively reclaim buffermem pages when the system was in a lowmem shortage situation. We are looking for as much testing on the e.40.6 kernel as we can get over the next few weeks. Thanks, Larry Woodman Yes and no. I can install that on our development server (where the rsync cron job has been failing) tonight, but not on our production database server (where we've seen both the kswapd issues and very frequent ENOBUFS errors)--mainly because of the name. If you could regenerate it so that it has a standard kernel name--no "test"--I can try installing it on the production server this weekend as well. The production server runs vendor software that's sensitive to kernel naming. If that's a problem, I can just install on the development server to make sure it doesn't cause any obvious problems. The production server has been running the kswapd-patched kernel (2.4.9-e.39.1smp) for some time now and hasn't had any more kswapd problems, so we're relatively sure that that fix is ok. We like putting .test to mark it as such. Is there any other name you might suggest that would play nice on the production server? And that's good practice. But the e.39.1 test kernel didn't have "test" (or anything else) in the name--which is why I could install it on production. I can't justify taking any risk whatsoever of this test affecting production, so if you can't provide another version with a standard kernel name I won't be able to install it on that system. Normally it wouldn't be a problem, but as I say we have (critical) vendor software that has a non-zero chance of being affected by this. I have installed the e.40.6.test kernel on our development server, though, so we'll see how that goes. The latest kernel is e.40.8, which again has the fix for this issue and the one for 117902, i also took, the ".test" out of the title so that it could get wider testing. Any feedback is much appreciated. thanks. http://people.redhat.com/~jbaron/.private/testing/2.4.9-e.40.8/ I've installed e.40.8 on our development server, and we'll be enabling it on our production server this weekend. It appears to be working fine--no more ENOBUFS errors, and no other apparent weirdness. hi John, just wanted to make sure you had the latest U5 beta kernel available, if you are interested: http://people.redhat.com/~jbaron/.private/u5/2.4.9-e.40.10/ thanks, -Jason Thanks for mentioning. Unless there's a compelling reason to install this new one, though, we're going to stay with the e.40.8smp kernel; we've already invested a lot of time into testing this issue, and it's painful to upgrade the kernel on our production servers because of downtime window requirements. The e.40.8smp kernel appears to be addressing our existing issues, though--no more ENOBUFS errors. (Actually there are some questions about the kswapd fix now, but I'll add those to bug id 117902.) An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2004-437.html |