Bug 117460

Summary: SSH fails in various contexts with ENOBUFS error
Product: Red Hat Enterprise Linux 2.1 Reporter: John Caruso <jcaruso>
Component: kernelAssignee: Jason Baron <jbaron>
Status: CLOSED ERRATA QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 2.1CC: alan, fenlason, knoel, mgalgoci, riel
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
URL: http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=75108
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-08-18 14:25:08 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 116726    
Attachments:
Description Flags
AF_UNIX test case
none
strace excerpt of hanging rsync cron job (under kernel e.39.9.testsmp) none

Description John Caruso 2004-03-04 08:57:06 UTC
This is actually an attempt to reopen bug 75108.  Please refer to 
that bug for relevant background--we're seeing all the same problems 
that existed in that case.

The main question is: why was that case closed?  Even if the problem 
was resolved in a later kernel, RedHat 2.1AS is a supported release 
with many enterprise customers using it--and it's based on the 2.4.9 
kernel.  William Ewing's 2.4.9-34enterprise kernel was 
not "WAAAAAAAAAAAY too old"...it was probably the current release of 
2.1AS at that time.  The current 2.1AS kernel release is 2.4.9-38.

We are seeing this problem as well, in spades.  Basically, any 
machine that comes close to utilizing all of its RAM seems to 
encounter this error; the symptoms are rsync commands (using ssh) 
that fail with ENOBUFS, ssh commands that stall or fail, and NTP 
instability (not clearly related, but it does seem so in our 
environment).

Since 2.1AS 1) is still a supported RedHat release, 2) will be for 
some time, and 3) is specifically being used by enterprise customers 
(like us) in mission-critical contexts, we would most definitely like 
to see whatever fix exists in a later kernel backported to 2.1AS.

[ Note: I'm marking this as severity "high" because this type of 
memory instability on a production system--e.g., an Oracle database 
server backing up a production site--is a major concern for 
enterprise customers. ]

Comment 1 Arjan van de Ven 2004-03-04 09:05:34 UTC
2.4.9-34enterprise is not an AS2.1 kernel, AS2.1 uses a different
versioning scheme and independent numbers, 2.4.9-e.38 is well over 2
years newer than 2.4.9-34enterprise.


Comment 2 John Caruso 2004-03-04 09:22:36 UTC
Ok, thanks for the clarification on that point--I was misled by the 
fact that "34" was about the correct release level at that time for 
RHAS2.1, and the word "enterprise" in the kernel name.  Regardless of 
that, we are in fact seeing all of the same symptoms as those in bug 
75108, but on a system running the latest AS2.1 errata kernel release 
(kernel-smp-2.4.9-e.38).  It seems likely that this is the same bug 
manifesting itself on a similar kernel release.

If the knowledge that the symptoms are the same as in bug 75108 and 
the apparent fact that this is a known bug (with a known resolution) 
are not sufficient to address this bug on AS2.1, I can try to capture 
whatever information (strace, tcpdump, etc) you may need to 
troubleshoot it.


Comment 3 John Caruso 2004-03-15 20:14:56 UTC
What's up with this bug?  It appears that the exchange above led to 
it falling through the cracks.  It's a live bug.

Larry, assuming this really is your bug at this point: what you said 
about buffermem in bug 117902 would also seem to apply to this bug 
(I'd always assumed they were related anyway, since both bugs show up 
at the same time when a system is low on memory).  Since Arjan 
claimed that the ENOBUFS/SSH bug was fixed in a later version of the 
kernel, it seems likely that that that fix could just be backported 
to RHAS 2.1...and it also seems likely that the fix is the same as 
the fix for bug 117902 (or is closely related).


Comment 4 Larry Woodman 2004-03-16 19:55:50 UTC
John, can you try out the same kernel that I posted in Bug 117902 and
see if it fixes this problem as well?  It appears that the ENOBUFS
error is also a side effect of running out of lowmem.

Larry


Comment 5 John Caruso 2004-03-18 04:48:38 UTC
We have a system that has gotten the ENOBUFS error fairly regularly 
in the past with an hourly rsync cron job--so much so that I'd 
filtered that output so I wouldn't see it.  I've removed that filter 
and I'm now waiting for the system to give me the error (which 
manifests itself as "writefd_unbuffered failed to write X bytes" with 
rsync), and once it does I'll try updating to your new kernel and see 
if that fixes it.

Unfortunately it may be tough to give a definite positive answer, 
because the problem itself doesn't happen predictably--I've had the 
filter off since yesterday and the cron job hasn't complained yet.


Comment 6 John Caruso 2004-03-18 07:11:35 UTC
So of course 30 minutes after I posted that response, the cron job 
started producing the error--and then I was able to reproduce it 
regularly myself as well.  So I've gone ahead with the update to the 
e.39.1 kernel you provided, and we'll see if we get the ENOBUFS error 
again with that kernel.  As I said, though, it'll be tough to give a 
definitive positive answer until we see the system run for a 
reasonably long time (weeks) without producing the error.


Comment 7 Larry Woodman 2004-03-18 14:45:31 UTC
John, if you do start seeing ENOBUFS error messages please collect
as mant AltSysrq-M outputs as you can around that time.  If you
can not do that let me know and I'll instrument a special kernel that
does it automatically when the kernel memory allocation failure occurs.

Larry


Comment 8 John Caruso 2004-03-23 20:04:11 UTC
We hit the ENOBUFS error today (with the e.39.1 kernel).  Here's the 
AltSysrq-M output, from about two minutes after I received the cron 
job output indicating the ENOBUFS error:

SysRq : Show Memory
Mem-info:
Free pages:      126272kB ( 78844kB HighMem)
( Active: 290921, inactive_dirty: 75731, inactive_clean: 42081, free: 
31568 (638 1276 1914) )
483*4kB 119*8kB 38*16kB 1*32kB 0*64kB 0*128kB 0*256kB 0*512kB 
0*1024kB 0*2048kB 0*4096kB = 3524kB
  active: 456, inactive_dirty: 444, inactive_clean: 0, free: 881 (128 
256 384)
9838*4kB 549*8kB 10*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 
0*1024kB 0*2048kB 0*4096kB = 43904kB
  active: 114401, inactive_dirty: 39013, inactive_clean: 0, free: 
10976 (255 510 765)
14335*4kB 2154*8kB 145*16kB 19*32kB 7*64kB 1*128kB 1*256kB 1*512kB 
0*1024kB 0*2048kB 0*4096kB = 78844kB
  active: 176064, inactive_dirty: 36274, inactive_clean: 42081, free: 
19711 (255 510 765)
Swap cache: add 7620784, delete 7412937, find 7084473/7188442
Page cache size: 363792
Buffer mem: 44941
Ramdisk pages: 0
Free swap:       2598996kB
522224 pages of RAM
292848 pages of HIGHMEM
9532 reserved pages
141539 pages shared
207849 pages swap cached
34 pages in page table cache
41773 pages in slab cache
Buffer memory:   179764kB
    CLEAN: 45159 buffers, 180168 kbyte, 140 used (last=45044), 0 
locked, 0 protected, 0 dirty
    DIRTY: 195 buffers, 780 kbyte, 195 used (last=195), 0 locked, 0 
protected, 186 dirty

Now that it's happened once, it'll probably start happening fairly 
regularly.


Comment 9 John Caruso 2004-03-23 20:39:07 UTC
I just ran the rsync cron job manually and it again encountered the 
ENOBUFS error.  Here's AltSysrq-M output from immediately after that 
error:

SysRq : Show Memory
Mem-info:
Free pages:      147708kB (141560kB HighMem)
( Active: 267271, inactive_dirty: 77154, inactive_clean: 19024, free: 
36927 (638 1276 1914) )
381*4kB 114*8kB 45*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 
0*1024kB 0*2048kB 0*4096kB = 3156kB
  active: 897, inactive_dirty: 11, inactive_clean: 0, free: 789 (128 
256 384)
254*4kB 223*8kB 12*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 
0*1024kB 0*2048kB 0*4096kB = 2992kB
  active: 100844, inactive_dirty: 47161, inactive_clean: 0, free: 748 
(255 510 765)
21358*4kB 6262*8kB 251*16kB 21*32kB 7*64kB 1*128kB 1*256kB 1*512kB 
0*1024kB 0*2048kB 0*4096kB = 141560kB
  active: 165530, inactive_dirty: 29982, inactive_clean: 19024, free: 
35390 (255 510 765)
Swap cache: add 7626172, delete 7447138, find 7098345/7203352
Page cache size: 315479
Buffer mem: 47970
Ramdisk pages: 0
Free swap:       2736188kB
522224 pages of RAM
292848 pages of HIGHMEM
9532 reserved pages
134025 pages shared
179036 pages swap cached
42 pages in page table cache
51258 pages in slab cache
Buffer memory:   191880kB
    CLEAN: 48266 buffers, 192596 kbyte, 152 used (last=48261), 0 
locked, 0 protected, 0 dirty
   LOCKED: 53 buffers, 212 kbyte, 53 used (last=53), 0 locked, 0 
protected, 0 dirty
    DIRTY: 112 buffers, 448 kbyte, 112 used (last=112), 0 locked, 0 
protected, 110 dirty


Comment 10 John Caruso 2004-03-24 18:51:30 UTC
Our production database server also produced an ENOBUFS error today 
while running the e.39.1 test kernel.  The error was produced when 
another system ran a command like "ssh <dbhost> cat <file>", and the 
error was actually from cat, not ssh: "/bin/cat: write error: No 
buffer space available".  (Sorry, no AltSysrq-M output for that one 
since I wasn't around when it happened.)


Comment 11 Michael Glasgow 2004-03-25 07:49:30 UTC
Created attachment 98842 [details]
AF_UNIX test case

The ENOBUFS appears to be on a unix domain socket, not a tcp socket.  Please
try this test case on affected systems.  When given an argument above about
32k, it fails on my affected systems.  It passes on unaffected systems.

sh -c './mg 65536; echo $?'

Zero is passing; anything else is a bitmapped error code.  See the source for
details.

Comment 12 John Caruso 2004-03-30 21:36:28 UTC
Michael: your program returns zero on systems of ours which regularly 
produce the ENOBUFS error, no matter what size I use (65536, 131072, 
262144, etc).

Because of the structure of ssh now, it's tricky to follow the 
sequence of descriptor manipulations (through the pipe, fork/exec, 
close, dup, etc calls) to determine whether or the particular 
descriptor in question is associated with a TCP socket.  There's no 
doubt that it happens on a data write, though (the contents of the 
failing write are contents from the copied file).  I'm trying to 
capture a sufficiently terse strace that exhibits the error, but it 
appears that the new kernel has enough of an effect that it makes the 
problem harder to reproduce consistently--but without actually 
resolving it.

Larry: if you're waiting for more info from us, please say 
so...otherwise I'm assuming the AltSysrq-M output I provided above is 
enough.


Comment 13 Jason Baron 2004-03-31 16:21:43 UTC
ok, i've been trying to reproduce this issue, but its been really
difficult to artificially re-produce as indicated. I have a good sense
as to what the problem is and i think that we should try: 'echo 4096 >
 /proc/sys/net/core/wmem_default'. I think this might resolve the
issue and then we can isolate the change perhaps to just affect unix
sockets, although i don't think this value has as wide reaching
affects on the network layer as might be expected. Any chance we can
test this change?

Comment 14 Jason Baron 2004-03-31 17:32:11 UTC
ok, i've been trying to reproduce this issue, but its been really
difficult to artificially re-produce as indicated. I have a good sense
as to what the problem is and i think that we should try: 'echo 4096 >
 /proc/sys/net/core/wmem_default'. I think this might resolve the
issue and then we can isolate the change perhaps to just affect unix
sockets, although i don't think this value has as wide reaching
affects on the network layer as might be expected. Any chance we can
test this change?

Comment 15 Jason Baron 2004-03-31 19:22:40 UTC
so here's what i was thinking of for a patch:

--- linux/net/unix/af_unix.c.bak	Wed Mar 31 13:35:42 2004
+++ linux/net/unix/af_unix.c	Wed Mar 31 14:27:36 2004
@@ -1344,6 +1344,11 @@ static int unix_stream_sendmsg(struct so
 		 */
 		 
 		skb=sock_alloc_send_skb(sk,size,msg->msg_flags&MSG_DONTWAIT, &err);
+	
+		while((err == -ENOBUFS) && (size > 4096)) {
+			size >>= 2;
+			skb = sock_alloc_send_skb(sk,size,msg->msg_flags&MSG_DONTWAIT, &err); 
+		}
 
 		if (skb==NULL)
 			goto out_err;


As long as we are getting -ENOBUFS and greater than a page size retry
the allocation asking for half as much memory.


Comment 16 Jason Baron 2004-03-31 19:27:47 UTC
updated patch:

--- linux/net/unix/af_unix.c.bak	Wed Mar 31 13:35:42 2004
+++ linux/net/unix/af_unix.c	Wed Mar 31 14:35:18 2004
@@ -1344,6 +1344,11 @@ static int unix_stream_sendmsg(struct so
 		 */
 		 
 		skb=sock_alloc_send_skb(sk,size,msg->msg_flags&MSG_DONTWAIT, &err);
+	
+		while((err == -ENOBUFS) && (size > PAGE_SIZE)) {
+			size >>= 1;
+			skb = sock_alloc_send_skb(sk,size,msg->msg_flags&MSG_DONTWAIT, &err); 
+		}
 
 		if (skb==NULL)
 			goto out_err;


Comment 17 John Caruso 2004-04-01 01:08:26 UTC
Ok, I've set wmem_default to 4096 as requested (on our machine that 
currently has an rsync job failing with ENOBUFS about 1-2 times every 
few days).  We'll see what happens.

Does the fact that Michael's mg program worked with all buffer sizes 
we tried mean that it's unlikely this patch would fix things, 
though?  I suppose not, since it's possible that at the time I was 
running mg the system wasn't in that magic state that causes the 
ENOBUFS to occur....

With the patch, what happens if an allocation attempt for PAGE_SIZE 
bytes fails?  And mightn't the caller be adversely affected by 
getting back a smaller buffer than expected?  (Mainly rhetorical 
questions--I'm sure you're thinking about those things.)

On reproducing the problem: it seemed to be a good bit easier to 
reproduce under e.38--I had several instances in which I'd narrowed 
it down to an scp of a single not-ridiculously-large (<300MB) file, 
reproducible at will.  So if you want to make it easier to reproduce, 
you might try falling back to e.38.  I may do that on our system as 
well, unless there's a good reason not to (since we already know that 
e.39.1 doesn't fix the problem outright).


Comment 18 Jason Baron 2004-04-01 02:30:59 UTC
The fact that the program worked with all buffer sizes, i don't think
makes this patch less likely to work. The theory is that we can't get
a large contiguous piece of memory b/c memory is fragmented in a
unique way. Reproducing this unique state of memory is difficult b/c
the system is so dynamic allocating and freeing pages. 

if the attempt for PAGE_SIZE fails, we end up with -ENOBUFS again.
however, it should be impossible to get the system into a state where
not ven one page is available. In terms of the smaller buffer, this
just means we are passing things through the network layer in a
smaller unit. we are still going to be passing the same amount of
data, but we break it up into smaller chunks. This could adversely
affact performance, and that is why i did the 'while' loop that first
tries for the requested size and then falls back to smaller and
smaller sizes.

I think the /proc thing will give us a good indication as to whether
or not this patch is the right idea in addressing this issue. Thanks
for doing that, and please let us know if you see any noticable
performance changes with this change. thanks.

Comment 19 John Caruso 2004-04-02 21:21:03 UTC
Ok, I've got some very useful feedback for you.  We have an rsync 
cron job that runs frequently on the affected production server but 
only rsyncs a relatively small amount of data, and it just started 
failing predictably (finally!).  On this system, wmem_default is set 
to 262143 as per many recommendations for NFS clients like this one, 
and the rsync job in question rsyncs files from a local filesystem to 
an NFS-mounted filesystem (i.e., SSH is not involved).

I ran the script in question several times with wmem_default set to 
262143 and it failed each time, but then I changed wmem_default to 
4096 and immediately ran it again and it succeeded--so the workaround 
does appear to have the effect you'd thought it would.  Also, FYI, 
when I then changed wmem_default back to 262143, the script again 
succeeded.

I have several straces of the failing rsyncs--if you need them, let 
me know (I won't just post them otherwise because they contain some 
sensitive data, and it would take time to clean them up before 
posting them).


Comment 20 Jason Baron 2004-04-08 03:01:01 UTC
ok, thanks for the informantion. So it looks like we are getting close
hopefully... I've got a kernel built with the patch above, although i
have to admit it hasn't seen much testing yet. The patch should obey
the large setting for wmem_default, unless it can't get the memory
segment requested, so i think that performance should be the same as
before. Test kernel is below, it also contians some additional fixes,
most notably the patches discussed in 117902. 

http://people.redhat.com/~jbaron/.private/testing/2.4.9-e.39.9.test/

Comment 21 John Caruso 2004-04-10 00:24:06 UTC
I've installed the e.39.9.test kernel on our server where the ENOBUFS 
errors occur frequently, and I'll reboot it as soon as possible (and 
reset wmem_default to the previous setting) to see if it fixes the 
problem.  FYI, we've not had any more ENOBUFS errors on this system 
with wmem_default set to 4096.


Comment 22 Jason Baron 2004-04-12 18:19:26 UTC
thanks! looking forward to hearing feedbak.

Comment 25 Jason Baron 2004-04-14 20:13:32 UTC
I don't want to sound pushy, but any updates on this one? this is a
very critical bug, and we're looking for as much feedback on this one
as possible. thanks.

Comment 26 John Caruso 2004-04-14 20:24:34 UTC
No ENOBUFS errors so far on the system that's running 2.4.9-
e.39.9.testsmp, since its reboot on Friday.  I hadn't posted any 
feedback about it yet because this system didn't always produce 
ENOBUFS errors within 4 days in the past, so there's still a chance 
that the bug will occur in the future.  Based on past behavior, I'd 
say that a week of error-free behavior should be enough for a fairly 
high degree of confidence that it's fixed.

(I do expect that the patch will successfully work around the bug, 
though, based on the wmem_default testing.)


Comment 27 Jason Baron 2004-04-14 20:27:35 UTC
thanks for the confidence, and the update.

Comment 28 Jason Baron 2004-04-19 15:39:15 UTC
how does the system look now that the system has presumably been
running the updated kernel for more than 1 week. Any -ENOBUFS?

Comment 29 John Caruso 2004-04-20 19:21:36 UTC
We've got good news and bad news.  The good news is that there are no 
more ENOBUFS errors.  Which is as expected; the code you've added 
would seem to make them almost impossible.  But the bad news is that 
the removal of the ENOBUFS errors appears to have resulted in another 
potentially more serious problem.

The cron job in question rsyncs files from a local disk to an NFS-
mounted directory.  With the e.39.9.testsmp, NFS is now hanging 
intermittently while attempting to read certain files--the rsync will 
sit there holding the file open (on the destination NFS volume) for 
hours, looping on a select() call, until it's finally killed.  This 
is causing errors in succeeding cron jobs, as they attempt to delete 
the opened files (actually the first deletion works and results in 
a .nfs file, but the later attempted deletion of the .nfs file is 
what produces the error).

We've been running this cron job for several years, and this has 
never happened before.  The fact that it only happened now with the 
e.39.9.testsmp kernel, and the fact that it started occurring in the 
uptime cycle right about when we would previously have started seeing 
ENOBUFS errors, would seem to indicate that it's a direct result of 
the workaround.  Apparently NFS doesn't always deal well with 
receiving buffers that are smaller than the ones it requested.

This is potentially a more serious bug for us than the original bug, 
since we rely on NFS for critical production functions.  It's bad to 
have rsync/ssh die with ENOBUFS errors, but to have NFS suddenly hang 
while reading critical data would be worse.

Let me know if you need any other info.  I've collected data from two 
separate incidents so far, but I'm not sure how useful it would be to 
you.  In particular, I've got AltSysrq-M output, but it was taken at 
the point when I received the cron job errors, not the point at which 
the rsync NFS read actually hung--so it may not be that useful.  The 
problem is occurring about 1-2 times per day.


Comment 30 Jason Baron 2004-04-21 02:29:17 UTC
hmmm, well nfs doesn't use unix domain sockets for transport, they are
only for local sockets, so i'm not sure exactly what's going on here,
but i certainly agree its related to the patch. Perhaps, its not quite
complete. I'll re-review it tomorrow. Also, what is the setting for
wmem_default during this time? 

Comment 31 John Caruso 2004-04-21 03:20:36 UTC
All kernel tunables related to networking are at defaults on this 
system, so wmem_default is set to 65535.

What you're saying implies that wmem_default only affects unix domain 
sockets, correct?  I'm surprised if that's the case, since the NFS 
recommendations I've seen frequently discuss changing wmem_default to 
improve NFS-over-UDP performance (e.g. 
http://www.netapp.com/tech_library/3183.html).  In any case, I don't 
know if the new problem is related to NFS or is happening within 
rsync, so that was just speculation on my part.  If it helps, the 
file on which rsync hangs is always on the destination, not the 
source (and the destination in this case is NFS-mounted).


Comment 32 Jason Baron 2004-04-21 11:28:08 UTC
wmem_default is supposed to be a system wide setting, so in theory it
could affect UDP buffer sizes and hence NFS performance. However, the
patch that i did is limited to unix sockets, so i would have expected
the patch to have a much more limited affect then chaning wmem_default.

Right, likely this change is probably affected how rsync is
communicating via unix sockets, and thus 'apparently' caused by NFS.

Comment 33 John Caruso 2004-04-21 17:49:08 UTC
Ok, got it--I didn't realize your patch affected unix domain sockets 
only.

In looking at the lsof/strace output for the rsync processes that are 
holding open the files on the destination, it does look like the 
select() call is hanging while trying to read a unix domain socket 
(and not the destination file).  Here's one example:

COMMAND   PID USER   FD   TYPE     DEVICE    SIZE     NODE NAME
rsync   27504 root  cwd    DIR       0,12    4096   
251498 /path/to/file (server:/mount)
rsync   27504 root  rtd    DIR        8,3    4096        2 /
rsync   27504 root  txt    REG        8,5  202212    
33250 /usr/bin/rsync
rsync   27504 root  mem    REG        8,3  495474   128034 /lib/ld-
2.2.4.so
rsync   27504 root  mem    REG        8,5   29500    
64285 /usr/lib/libpopt.so.0.0.0
rsync   27504 root  mem    REG        8,3  263059   
128104 /lib/libresolv-2.2.4.so
rsync   27504 root  mem    REG        8,3 5797952    
16316 /lib/i686/libc-2.2.4.so
rsync   27504 root  mem    REG        8,3  262396   
128096 /lib/libnss_files-2.2.4.so
rsync   27504 root    0u  unix 0xeeacc040         10595151 socket
rsync   27504 root    1r   REG       0,12 3232829  
2944263 /path/to/file/.nfs007abc0400000020 (server:/mount)
rsync   27504 root    2w  FIFO        0,0         10595150 pipe
rsync   27504 root    4u  unix 0xe5202580         10598232 socket
rsync   27504 root    6u  unix 0xe4f3e580         10598234 socket

# strace -fp 27504
Process 27504 attached - interrupt to quit
select(1, [0], NULL, NULL, {54, 30000}) = 0 (Timeout)
select(1, [0], NULL, NULL, {60, 0})     = 0 (Timeout)
select(1, [0], NULL, NULL, {60, 0} <unfinished ...>


Comment 34 Jason Baron 2004-04-21 21:10:33 UTC
John,

Any chance you have more of the strace of the rsync process? For
example, what rsync was doing before hanging in select. thanks.

Comment 35 John Caruso 2004-04-21 23:44:24 UTC
Unfortunately no--that strace was taken well after the process had 
hung.  It would be difficult to get an strace of the actual process 
as it's hanging, since it's a cron job that's rsync'ing a large 
volume of data (>15GB) and it runs every hour.  And strace'ing it 
generates so much data that it may actually affect the behavior of 
the bug itself.  If you need that to go forward, though, let me know 
and I'll see if it's practicable to strace each instance of the cron 
job (although then we'll have the problem of uploading the strace, 
which I believe was several hundred MB when I tried it once 
before...).


Comment 36 Jay Fenlason 2004-04-23 15:45:52 UTC
Unfortunatly, without more context in the rsync strace, there's no way
to tell whether it's a kernel problem (related to the patch) or a
userspace problem (a bug in rsync for me to look into).  Fortunatly,
you won't need to upload the entire strace--the last few hundred lines
of it should do.

Comment 37 Jason Baron 2004-04-23 19:50:04 UTC
Also, what version of rsync are you running?

Comment 38 John Caruso 2004-04-23 20:31:16 UTC
We're running rsync 2.5.7 (from the rsync-2.5.7-0.7 RPM, which is the 
most current one on RHN).  The ENOBUFS errors have occurred since 
well back into rsync 2.4.x.  Just as a reminder that the original 
underlying issue isn't only with rsync, please note comment 10 above; 
we also occasionally saw the ENOBUFS errors with scp.

I've added strace'ing to the rsync cron job, so now we'll just have 
to wait for the error to occur again.


Comment 39 John Caruso 2004-04-27 19:14:05 UTC
Ok, the rsync cron job running under strace finally started failing 
again.  I'm attaching a section of the end of the strace which I 
believe encompasses as much as is useful from that section; it starts 
with a sequence of lstat64 calls (of which there were thousands, in 
sequence--I've included just the last few), and continues through the 
point at which the rsync hangs.  I've changed names to protect the 
innocent.  If you need more, there's plenty--the full output is 
263MB / 4 million lines.


Comment 40 John Caruso 2004-04-27 19:15:47 UTC
Created attachment 99724 [details]
strace excerpt of hanging rsync cron job (under kernel e.39.9.testsmp)

strace excerpt of hanging rsync cron job (under kernel e.39.9.testsmp)

Comment 41 Jay Fenlason 2004-05-03 19:37:55 UTC
I've been looking at the strace output, and it looks like the 
sending rsync writes 20106 bytes of file data to the socket, then 
does a select() for writability on the socket.  The receiving rsync 
reads and processes all the data from the socket (data from a 
previous file and then the 20106 bytes of file data).  The receiver 
then does a select for readability on the socket.  The sender's 
select times out, so it tries again.  The retry times out, so it 
tries again. . . 
 
So the kernel is never telling the sender that it can write to the 
socket, even after the receiver has red all the data. 
 
I can try writing a quick test program that attempts to replicate 
rsync's behavior here, if you need final proof, but it's clear to me 
the problem isn't in rsync itself. 

Comment 42 Jason Baron 2004-05-03 19:57:37 UTC
Yes, i've looked into this a bit more a the way the kernel does
reference counting for the case where a buffer is broken up is
incorrect and would result in this type of hang. i'll post a patch,
and a test kernel shortly.

Comment 43 Jason Baron 2004-05-03 20:49:31 UTC
ok, here is the corrected patch. I should have been resetting err to
0, b/c the -ENOBUFS seems like it would stick to its orignal value.
This caused the wmem_alloc param to never drop, and thus we apparently
had no room in the buffer to write. Below should be the correct patch:

--- linux/net/unix/af_unix.c.bak	Wed Mar 31 13:35:42 2004
+++ linux/net/unix/af_unix.c	Wed Mar 31 14:35:18 2004
@@ -1344,6 +1344,12 @@ static int unix_stream_sendmsg(struct so
 		 */
 		 
 		skb=sock_alloc_send_skb(sk,size,msg->msg_flags&MSG_DONTWAIT, &err);
+	
+		while((err == -ENOBUFS) && (size > PAGE_SIZE)) {
+			size >>= 1;
+			err = 0;
+			skb = sock_alloc_send_skb(sk,size,msg->msg_flags&MSG_DONTWAIT, &err); 
+		}
 
 		if (skb==NULL)
 			goto out_err;

Comment 44 Jason Baron 2004-05-04 17:29:09 UTC
ok, here's a kernel with the corrected patch:

http://people.redhat.com/~jbaron/.private/testing/2.4.9-e.40.4.test/

Comment 45 John Caruso 2004-05-06 01:21:32 UTC
Installed (as of last night).  Now we'll just have to play the 
waiting game again to see if nothing happens.



Comment 46 Jason Baron 2004-05-12 13:52:47 UTC
any updates?

Comment 47 John Caruso 2004-05-12 19:19:06 UTC
The rsync cron job has been behaving, and we haven't seen any other 
odd behavior.  As before, I was waiting for a week of error-free 
behavior to say that it appears to be fixed...and we just reached 
that today.  So as far as I can tell it appears that this patch 
successfully works around the ENOBUFS issue without introducing any 
other issues.  Thanks for all the work.

What would be the schedule for rolling this out in a new kernel 
release?  And is the plan to combine this fix with the kswapd fix 
from bug 117902?


Comment 48 Larry Woodman 2004-05-12 19:26:07 UTC
John, we did combine this with the kswapd fix, can you test the latest
kernel in:

http://people.redhat.com/~jbaron/.private/testing/2.4.9-e.40.6.test/


Larry Woodman


Comment 49 Larry Woodman 2004-05-12 19:30:25 UTC
BTW, we improved the fix for bug 117902 to more aggressively reclaim
buffermem pages when the system was in a lowmem shortage situation.  
We are looking for as much testing on the e.40.6 kernel as we can 
get over the next few weeks.

Thanks, Larry Woodman


Comment 50 John Caruso 2004-05-12 19:34:44 UTC
Yes and no.  I can install that on our development server (where the 
rsync cron job has been failing) tonight, but not on our production 
database server (where we've seen both the kswapd issues and very 
frequent ENOBUFS errors)--mainly because of the name.  If you could 
regenerate it so that it has a standard kernel name--no "test"--I can 
try installing it on the production server this weekend as well.  The 
production server runs vendor software that's sensitive to kernel 
naming.

If that's a problem, I can just install on the development server to 
make sure it doesn't cause any obvious problems.  The production 
server has been running the kswapd-patched kernel (2.4.9-e.39.1smp) 
for some time now and hasn't had any more kswapd problems, so we're 
relatively sure that that fix is ok.


Comment 51 Jason Baron 2004-05-13 01:11:53 UTC
We like putting .test to mark it as such. Is there any other name you
might suggest that would play nice on the production server?

Comment 52 John Caruso 2004-05-13 20:15:00 UTC
And that's good practice.  But the e.39.1 test kernel didn't 
have "test" (or anything else) in the name--which is why I could 
install it on production.  I can't justify taking any risk whatsoever 
of this test affecting production, so if you can't provide another 
version with a standard kernel name I won't be able to install it on 
that system.  Normally it wouldn't be a problem, but as I say we have 
(critical) vendor software that has a non-zero chance of being 
affected by this.

I have installed the e.40.6.test kernel on our development server, 
though, so we'll see how that goes.


Comment 54 Jason Baron 2004-05-24 19:54:33 UTC
The latest kernel is e.40.8, which again has the fix for this issue
and the one for 117902, i also took, the ".test" out of the title so
that it could get wider testing. Any feedback is much appreciated. thanks.

http://people.redhat.com/~jbaron/.private/testing/2.4.9-e.40.8/

Comment 55 John Caruso 2004-05-27 19:32:36 UTC
I've installed e.40.8 on our development server, and we'll be 
enabling it on our production server this weekend.  It appears to be 
working fine--no more ENOBUFS errors, and no other apparent weirdness.


Comment 56 Jason Baron 2004-06-08 15:14:59 UTC
hi John,

just wanted to make sure you had the latest U5 beta kernel available,
if you are interested:

http://people.redhat.com/~jbaron/.private/u5/2.4.9-e.40.10/

thanks,

-Jason

Comment 57 John Caruso 2004-06-09 19:36:19 UTC
Thanks for mentioning.  Unless there's a compelling reason to install 
this new one, though, we're going to stay with the e.40.8smp kernel; 
we've already invested a lot of time into testing this issue, and 
it's painful to upgrade the kernel on our production servers because 
of downtime window requirements.

The e.40.8smp kernel appears to be addressing our existing issues, 
though--no more ENOBUFS errors. (Actually there are some questions 
about the kswapd fix now, but I'll add those to bug id 117902.)


Comment 58 John Flanagan 2004-08-18 14:25:08 UTC
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2004-437.html