Bug 75108
Summary: | openssh is getting ENOBUFS and dying | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Need Real Name <aander07> |
Component: | kernel | Assignee: | Arjan van de Ven <arjanv> |
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 7.2 | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i386 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2003-07-29 13:46:54 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Need Real Name
2002-10-04 16:10:25 UTC
To be clear, ssh is getting ENOBUFS and dying on machine B, and the memory output from above is from machine B, and represents a typical state when this issue is encountered. Machine A configuration: total used free shared buffers cached Mem: 523856 521624 2232 0 10868 468352 -/+ buffers/cache: 42404 481452 Swap: 2104432 0 2104432 Module Size Used by cls_u32 4528 0 sch_tbf 2192 0 sch_cbq 11248 0 vfat 9712 0 (autoclean) (unused) fat 31968 0 (autoclean) [vfat] e100 37968 2 (autoclean) nfs 73472 7 (autoclean) lockd 45168 1 (autoclean) [nfs] sunrpc 64816 1 (autoclean) [nfs lockd] dummy0 960 0 (autoclean) (unused) raid0 3136 1 (autoclean)Description e100 driver reports: Description e100 - Intel(R) PRO/100+ Server Adapter Driver_Name Intel(R) PRO/100 Fast Ethernet Adapter - Loadable driver Driver_Version 1.3.20 PCI_Vendor 0x8086 PCI_Device_ID 0x1229 PCI_Subsystem_Vendor 0x8086 PCI_Subsystem_ID 0x100c PCI_Revision_ID 0x08 Machine B configuration: Module Size Used by cls_u32 4528 0 sch_tbf 2192 0 sch_cbq 11248 0 vfat 9712 0 (autoclean) (unused) fat 31968 0 (autoclean) [vfat] e100 37968 2 (autoclean) nfs 73472 7 (autoclean) lockd 45168 1 (autoclean) [nfs] sunrpc 64816 1 (autoclean) [nfs lockd] dummy0 960 0 (autoclean) (unused) raid0 3136 1 (autoclean) (The tainted flag comes from the e100 driver.) e100 driver reports: Description Intel(R) PRO/100+ Server Adapter (PILA8470B) Driver_Name e100 Driver_Version 1.6.22 PCI_Vendor 0x8086 PCI_Device_ID 0x1229 PCI_Subsystem_Vendor 0x8086 PCI_Subsystem_ID 0x100c PCI_Revision_ID 0x0008 During the file copy process, the data is being read from system A, and copied to system B. Again, this behavior has been noted both with straight scp from system A to system B, and with rsync over ssh from system A to system B. We have noted two other things of interest: 1) pure scp tends to trigger this faster, rsync over ssh tends to do marginally better. 2) if we strace the rsync over ssh on machine A or traffic shape machine A down to 2Mb/s output to machine B, it tends to complete more often. We still get failures, but the key is slowing down the rate at which data is sent to machine B. The push is from machine A -> machine B; most of the data goes that direction. One more data point, I have been unable to trigger this so far just using netcat of large files. This appears isolated to the use of openssh. Even though 2.4.18 has problems with your workload, can you at least test that 2.4.18 makes this problem go away? Can you also make sure to enter into bugzilla the problems that the 2.4.18 kernel causes because that problem needs to be taken care of eventually should we use 2.4.18+ kernels for future errata. After reviewing the code paths in question, the only way that a TCP socket sendmsg() call can return ENOBUFS is if sendmsg(): 1) Is given a msg_controllen > INT_MAX 2) A sock_kmalloc of size msg_controllen fails What is msg_controllen when openssh gets these ENOBUFS errors back from a TCP sendmsg() call? Also, in the future it would be really nice if captured strace output was provided not just "gets ENOBUFS" as the latter does not tell us what system call the error is being returned from. Created attachment 78906 [details]
strace of a failing sshd showing ENOBUFS from write()
Created attachment 78907 [details]
strace of a failing sshd showing ENOBUFS from write()
Created attachment 78908 [details]
another strace of a failing sshd showing ENOBUFS from write()
What do I need to do to supply the value for msg_controllen? I also have a ~250MB tcpdump taken during one of these failures if you need that, but trying to upload that to bugzilla does not seem the smartest approach. As a data point, 2.4.18-17.7.x had not exhibited this same behavior yet. Has there been any update on this call as I am experiencing a similiar problem.
I am using Kernel 2.4.9-34enterprise and when a burst of users are logging on
can get Kernel errors but also certain commands within the application (HP
OpenMAil) nolonger work. I logged this initially with HP and they got me to do
straces on certain processes and their report is as follows
> >
> > I also did a strace on omsessd on the system here to compare the output.
> >
> > I also got occurences of "kill(..., SIG_)) = -1 EPERM" and the
> > "accept(3,...[110]) = -1 EAGAIN", and don't think that these
> > show what the problem is.
> >
> > Looking at the trace for the failing case, there is firstly
> > the expected read
> >
> > read (4,
> > "\30\0\0\0\5\0\0\0\0\0\0\0\1\0\0\0\377\377\0\0\0\0\0\0\0",40)=24
> >
> > This matches the write from omstat.
> >
> > The write from omsessd then gives the following
> >
> > write(4, "........", 35068) = -1 ENOBUFS (No buffer space available)
> >
> > compared to
> >
> > write(4, ".......", 18912) = 18912
> >
> > when omstat works.
> >
> > So if looks like the full data is being written out to the
> > socket, but there is a problem with buffer space.
> > --
william ewing: you're running a WAAAAAAAAAAAY too old kernel. Are you syaing that this problem has been resolved in a later Kernel ? yes |