Red Hat Bugzilla – Bug 1003011
suboptimal nslcd protocol handling, spammy /var/log/messages
Last modified: 2013-11-10 02:51:56 EST
I noticed while trying to dig into memory problems that I'm currently getting a *lot* of these in /var/log/messages:
Aug 30 13:36:30 hostname nslcd: [bb40e4] <group=1000> error writing to client: Broken pipe
Aug 30 13:36:30 hostname nslcd: [c228b7] <group=1000> error writing to client: Broken pipe
Aug 30 13:37:18 hostname nslcd: [7d8042] <group=1000> error writing to client: Broken pipe
Aug 30 13:37:18 hostname nslcd: [4826c4] <group=1000> error writing to client: Broken pipe
Usually a flurry whenever I do anything on my desktop, in particular opening a new shell window/tab generates a few (so running a nested bash instance was a logical possibility, and doing that seems to reliably be the cause). After a bit of digging, frustrated by the fact that running the client program under strace makes the problem reliably disappear, but attaching strace to the nslcd progress brings it back, I tracked this to an "id -gn" subprocess calling getgrgid(1000).
The group in question contains many users, and driving the request directly shows that nslcd is returning a little over 4kB in its response:
( echo -ne "\001\000\000\000\0212\023\000\000\0350\003\000\000" ; sleep 1 ) | nc -U /var/run/nslcd/socket | wc -c
But getgrgid() internally calls with an initial 1kB buffer, doubling each time it gets a buffer overflow. Each time it flushes any outstanding readable data from the nslcd daemon by polling POLLIN with a zero timeout (in common/tio.c:tio_skipall()) then closes the socket, causing nslcd to get EPIPE on its next write if it hasn't completed the response yet. The unreliability of repro is down to a race between nslcd's individual 64-byte writes, and whether the client can issue a poll/read/poll cycle faster than nslcd can get the next write out. stracing the client slows it enough that nslcd can keep the pipe full to the end of the response, stracing nslcd slows it down to match so the client can see zero bytes available in the poll again and close early.
Googling shows this is a known problem:
In particular the poll/read/poll the client does was added specifically in an attempt to avoid this case by discarding outstanding data before closing. But the 0 poll timeout results in this race and the "fix" regularly not working.
Two solutions (I'd like both, but either would probably work):
1) The client is clearly *prepared* to, and intends to, read arbitrary amounts of data, discarding it and waiting for the final close of the pipe that nslcd will do once the response is complete - even if due to client-side buffering and the inconvenient getgrgid() et al. APIs we can't use any more of the data this time around. The fact that it gets confused in the middle sometimes is an accident. We trust nslcd to a large extent, and if we're using the data are prepared to wait up to 60 seconds for it. nslcd *is* delivering the data, and the syscall overhead is fractions of a millisecond on both ends, just an unfortunate race here.
So we could just wait with a non-zero timeout while in tio_skipall(). Using the main timeout of 60s wouldn't be a problem, but given we're discarding, even a short timeout, even as low as a few milliseconds, would be enough to complete the response without dropping it in the middle.
2) If the client can legitimately decide that it's no longer interested in the rest of the response, then EPIPE on write is pretty much a non-error and nslcd shouldn't bother logging it. Perhaps only if --debug is used. Any other error code, fine, but EPIPE can be safely ignored here.
The message isn't harmful as such, but there's an awful lot of them and they are kind of annoying when trying to find other more relevant stuff in syslog, and they don't indicate anything interesting or wrong happened.
Interesting, thank you for the bug report. I was under the impression that the problem was already fixed in that version.
Yes, according to the -users thread above, *a* fix was put into 0.8.7. It's just not a good enough fix. (Hmm. Perhaps it mostly works on a heavily loaded or single-core machine, where the write sequence isn't expected to get preempted by the read sequence. With two free running cores on this 8-core machine, the race loses all the time!)
Thanks for your detailed analysis and comments.
Regarding 1) I think it is an excellent proposal to avoid more of the broken pipe errors. I will add a short timeout of 500 msec because the request will probably be repeated a number of times and having a longer timeout could ensure in the application hanging longer if something goes wrong.
Regarding 2) I agree that the broken pipe error is, in general, not always an indication that something is terribly wrong. On the other hand it is a indication that a request was broken off.
Anyway, I haven't seen a "real" problem that caused the EPIPE error in a long time so I think it should be reasonably safe to ignore.
I'm going to include both changes in the next 0.9 upstream release and may also include the changes in a next 0.8 release.
Thank you very much for looking into this, Arthur. I'll be happy to test these patches as well.
The change for 1) is here:
The change for 2) is here:
I've only done some quick testing so feedback is very much appreciated.
That'll work most of the time, but the timeout can fail in clients using fast periodic SIGALRMs for their own purposes (the resulting EINTR resets to the full timeout each time.) There is existing logic in tio.c to handle this using tio_get_deadline/tio_time_remaining.
(As a complete sidenote, timeouts using gettimeofday() are inherently unreliable in the face of system clock changes, and though tio_wait attempts to handle this by clamping the residual to the total timeout, a better way is to use clock_gettime(CLOCK_MONOTONIC) instead. And again, in the face of fast periodic SIGALRMs the clamping will have no effect.)
(Oh, and it looks like the EINTR/EAGAIN check in tio_read() can drop through and leave -1 in readbuffer.len, which being declared a size_t within tio_buffer earlier in tio.c will "pass" the readbuffer.len>=count check next time round the loop and copy an inexhaustible supply of garbage into the client's buffer. Or segfault. Unlikely to hit if poll() has just returned immediate readability, but still possible perhaps.)
In the 0.9 series the tio_get_deadline() function was replaced, mainly to reduce the number of calls to gettimeofday() changing some bits of the code.
I've had a look at also using tio_wait() inside tio_skipall() and that isn't too difficult (actually it results in much cleaner code). Switching to clock_gettime(CLOCK_MONOTONIC) also seams easy but I want to test portability affects of that change before merging. Thanks for pointing it out.
Thanks for pointing out the EINTR/EAGAIN problem in tio_read(). Fixes for this have been committed and I will probably make new releases soonish with the fixes. I agree that is unlikely to to happen and I don't think this can be triggered maliciously.
Thanks again for your valuable feedback.
(In reply to Arthur de Jong from comment #7)
> In the 0.9 series the tio_get_deadline() function was replaced, mainly to
> reduce the number of calls to gettimeofday() changing some bits of the code.
On a similar vein to this, I noticed from the straces that nslcd always writes in 64 byte chunks, even though it must have pretty much all the data up front and therefore "know" how much there is, which means we're needlessly repeating poll/write in nslcd and gettimeofday/poll/read. (Well, instrumenting shows tio_write() called with lots of 3-11 byte chunks, so it doesn't know the total size up front, but is constructing the response without extra waits, I think.)
It looks like nslcd allows up to a 1MB buffer, but that tio will only grow the buffer from the initial 64 byte guess if an attempt to do the non-blocking write fails, but since the client is spinning on poll/read just as fast as nslcd can send the data (and in any case the underlying channel buffer is much larger than the total amount of data we're writing), this never happens.
(Another point here: the client end (common/nslcd-prot.c) uses an initial buffer of 1024 bytes with a max of 2MB, which seems oddly inconsistent.)
The initial nslcd writebuffer.size could be made larger, and/or we could always add the new data to the buffer if size<maxsize and not flush unless we hit maxsize or the caller flushes at the end of message. (I suspect many users will frequently hit a small number of kB, but hitting 1MB is much rarer. In any case, the local malloc/memcpy overhead is likely to be much less than the syscall overhead of lots of small writes.)
The 0.9.0 version of nss-pam-ldapd has already seen a few improvements in the communications handling to reduce the number of system calls.
The main reason that nslcd has a large write buffer is to handle slow NSS calls while reading as much data from the LDAP server as possible. For example, an application could call setpwent(), getpwent(), do stuff, getpwent(), do more stuff, etc. The application could be slow between getpwent() calls or the LDAP server could be slow in returning entries.
The buffer handling probably isn't ideal but it works reasonably well and seems to balance returning (partial) data quickly to the requesting application and doing buffering when needed.
We could go with a larger write buffer to start off with. One average passwd entry could be around 64 bytes in size. Waiting for four entries from LDAP before starting to write to the caller is probably OK.
Originally it was also an idea to hand off the flushing of the buffer to a separate thread that would write any remaining data in the buffers, so the worker thread would be available again to handle requests.
The reason that the client buffer is larger (at least the max) is to support retries of getgrent(). For groups with a lot of users, the function will be called (by libc) with a too small buffer and it will retry until the buffer is large enough. If we are going over the groups we can't easily restart the whole request so we do the equivalent of a seek() in the stream back to the beginning of the group entry. Groups that are bigger than 2MB are too big anyway.
Thanks again for your comments.
nss-pam-ldapd-0.8.13-3.fc19 has been submitted as an update for Fedora 19.
nss-pam-ldapd-0.8.13-4.fc20 has been submitted as an update for Fedora 20.
nss-pam-ldapd-0.8.13-3.fc18 has been submitted as an update for Fedora 18.
* should fix your issue,
* was pushed to the Fedora 20 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing nss-pam-ldapd-0.8.13-4.fc20'
as soon as you are able to.
Please go to the following url:
then log in and leave karma (feedback).
nss-pam-ldapd-0.8.13-3.fc18 tested, and appears to resolve issue.
Thank you for testing and working with Arthur towards a good solution!
nss-pam-ldapd-0.8.13-3.fc19 has been pushed to the Fedora 19 stable repository. If problems still persist, please make note of it in this bug report.
nss-pam-ldapd-0.8.13-3.fc18 has been pushed to the Fedora 18 stable repository. If problems still persist, please make note of it in this bug report.
nss-pam-ldapd-0.8.13-4.fc20 has been pushed to the Fedora 20 stable repository. If problems still persist, please make note of it in this bug report.