Bug 1209600
| Summary: | Getting ERROR (getpwnam() failed): Broken pipe with 1.11.6 | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Orion Poplawski <orion> |
| Component: | sssd | Assignee: | Lukas Slebodnik <lslebodn> |
| Status: | CLOSED ERRATA | QA Contact: | Namita Soman <nsoman> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 6.6 | CC: | andrew.prowant, davids, grajaiya, hartsjc, jgalipea, jhrozek, ktadimar, lslebodn, mdshaikh, mkosek, mzidek, nsoman, orion, paulds, pbrezina, preichl, rgroten, ruben, sgoveas, ssorce, striker, tlavigne, unix_admins |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | sssd-1.13.3-18.el6 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-05-10 20:21:58 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1075802, 1159926, 1172231 | ||
|
Description
Orion Poplawski
2015-04-07 18:02:02 UTC
This sounds an awful lot like https://bugzilla.redhat.com/show_bug.cgi?id=1142351. Since the suspected error was in the client piece, the best debug info would be strace of crond while checking the credentials. Do you know what exactly crond does? Feel free to point us to the code..if not, we can involve the crond maintainer, but it seems like you did quite a bit of research already (thanks!) I'm not authorized for 1142351, so I can't comment on that.
cron code is from cronie-1.4.4/src/cron.c:506
/* check if user exists in time of job is being run f.e. ldap */
if ((pw = getpwnam(uname)) != NULL) {
... run a job if scheduled ....
}
else {
log_it(uname, getpid(), "ERROR", "getpwnam() failed", errno);
}
So if the user exists, run a cron if applicable. Otherwise log the error (which is a new patch). So all errors are treated as fatal. I suppose one might argue that one could also retry on EINTR, so perhaps a retry in the cron code is worth while.
With the current configuration I think this only happens when client_idle_timeout is a multiple of 60, thus allowing sssd_nss to close the connection as libnss_sss is writing to it.
I suspect though that in sssd-1.11.6/src/sss_client/common.c sss_cli_send_req() you want to also want to handle EPIPE, and re-establish a connection if you get it, if that is possible:
errno = 0;
if (datasent < SSS_NSS_HEADER_SIZE) {
res = send(sss_cli_sd,
(char *)header + datasent,
SSS_NSS_HEADER_SIZE - datasent,
SSS_DEFAULT_WRITE_FLAGS);
} else {
rdsent = datasent - SSS_NSS_HEADER_SIZE;
res = send(sss_cli_sd,
(const char *)rd->data + rdsent,
rd->len - rdsent,
SSS_DEFAULT_WRITE_FLAGS);
}
error = errno;
if ((res == -1) || (res == 0)) {
if ((error == EINTR) || error == EAGAIN) {
/* If the write was interrupted, go back through
* the loop and try again
*/
continue;
}
/* Write failed */
sss_cli_close_socket();
*errnop = error;
return SSS_STATUS_UNAVAIL;
}
I'll try to get a better strace - the one line I have now does seem to support this though;
1579 11:25:01 sendto(6, "\30\0\0\0\21\0\0\0\0\0\0\0\0\0\0\0", 16, MSG_NOSIGNAL, NULL, 0) = -1 EPIPE (Broken pipe)
Apr 7 11:25:01 netdrmsdb crond[1579]: (sumsmgr) ERROR (getpwnam() failed): Broken pipe
Yeah, here's some more:
09:06:01 select(6, [5], NULL, NULL, {0, 0}) = 0 (Timeout)
09:06:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 8
09:06:01 fstat(8, {st_mode=S_IFREG|0644, st_size=1589, ...}) = 0
...
09:06:01 close(8) = 0
09:06:01 munmap(0x7f11c56cc000, 4096) = 0
09:06:01 poll([{fd=6, events=POLLIN|POLLOUT}], 1, 300000) = 1 ([{fd=6, revents=POLLOUT}])
09:06:01 poll([{fd=6, events=POLLOUT}], 1, 300000) = 1 ([{fd=6, revents=POLLOUT}])
09:06:01 sendto(6, "\31\0\0\0\21\0\0\0\0\0\0\0\0\0\0\0", 16, MSG_NOSIGNAL, NULL, 0) = -1 EPIPE (Broken pipe)
09:06:01 close(6) = 0
09:06:01 gettimeofday({1428505561, 789678}, NULL) = 0
09:06:01 sendto(4, "<75>Apr 8 09:06:01 crond[25230]"..., 83, MSG_NOSIGNAL, NULL, 0) = 83
So select/poll indicates no issues, but socket is closed when it goes to write to it.
Simo, do you have some immediate thoughts? (In reply to Jakub Hrozek from comment #5) > Simo, do you have some immediate thoughts? We should handle EPIPE, if sssd_nss crashes and is restarted while a client is trying to send info we'd probably get EPIPE. While thinking about that I think we should retry also if we get EPIPE on recv(), but in that case we cannot just retry the recv(). In gneral when we get EPIPE we do not know what the server recieved so we should retry the whole operation. If we add EPIPE handling though the number of retries should be limited though, to avoid infinite loops (both on send() and recv() errors). Thanks, I'll clone upstream. Upstream ticket: https://fedorahosted.org/sssd/ticket/2626 Is it still reproducible on rhel6.7? Is there problem with all cron job for users from sssd? Could you provide example? (In reply to Lukas Slebodnik from comment #11) > Is it still reproducible on rhel6.7? Yes. And I don't see anything noted in the upstream ticket about any work on it. > Is there problem with all cron job for users from sssd? Yes. > Could you provide example? I'm not sure what more is needed that what is already provided. Orion, I'm sorry for asking you one more time but I;m not able to reproduce this bug on minimal rhel6.7 even though you gave me an advice how to reproduce it. Here is an output of crontab for sssd user -bash-4.1$ crontab -l * * * * * /bin/ls -l /tmp * * * * * /usr/bin/touch /tmp/pok-`date +%s` * * * * * /bin/touch /tmp/pok-`date +%s` I can see many "termination of client" in sssd logs but there isn't any error in cron log. [root@host ~]# grep "Terminating idle client" /var/log/sssd/sssd_nss.log | wc -l 152 [root@host ~]# echo $? 1 [root@host ~]# wc -l /var/log/cron 8817 /var/log/cron So there must be some missing info (different configuration ...). Do you have default configuration of cronie, cronie-anacron, crontabs ... I'm sorry but I cannot fix the bug without reproducer. Do you have any idea about more reliable reproducer? I would be thankful if it happened once or twice a day. If you do not have an idea could you provide log file from sssd with miliseconds debugging + the same for strace output. The output from strace in Comment 4 is not enough for reproduce. Put folwing lines to sssd.conf into [nss] section debug_level = 9 debug_microseconds = true and call strace strace -ttt -o strace_crond -p `pgrep crond` I would like to apologize one more time but deadlines are approaching and I would like to fix this bug. and it would be good to know timestamps in crond log with ERROR messages I'm not sure how much more time I can put into this either as we have a workaround. Perhaps there needs to be some other load on the machine to trigger it? Systems are fairly standard config. ldap domain: [domain/default] auth_provider = ldap cache_credentials = True ldap_search_base = dc=nwra,dc=com ldap_autofs_search_base = dc=cora,dc=nwra,dc=com id_provider = ldap ldap_id_use_start_tls = True chpass_provider = ldap ldap_chpass_uri = ldap://xxx/ ldap_uri = ldap://xxx/,ldap://xxx/ ldap_tls_cacertdir = /etc/openldap/cacerts ldap_account_expire_policy = 389ds [sssd] services = nss, pam, autofs config_file_version = 2 domains = default I tried to reproduce the issue with a simple script and wasnt able to .. let me check with the customer on Monday and update you on the load and timings We can't reproduce the issue in-house, so I'm re-setting the flags for 6.9 for now. We can reconsider this bugzilla for 6.8 if we know how to reproduce the issue, but so far we haven't been able to, sorry.. We have been seeing the same behavior on CentOS 7 systems, with sssd-1.13.0-40.el7_2.1.x86_64. Same scenario: user accounts in LDAP, per-minute user cron job, default sssd config. It is likely a difficult problem to reproduce on demand; with a 1/minute cron job, we're only seeing this happen a couple times a day. I can confirm that the solution described in the solutions article above <https://access.redhat.com/solutions/1125133> does essentially resolve the issue. We added "client_idle_timeout = 67" to the "[nss]" section of /etc/sssd/sssd.conf, and restarted the sssd service, and the errors have ceased. This doesn't technically resolve the issue of sssd's pipe handling, but it should resolve this timing conflict with cron. Seems like it might be sensible to change the default value of client_idle_timeout on this basis alone. It's a simple change, there should be no negative consequences, no code change is required, and it will mitigate the issue in the vast majority of cases. I prepared packages for rhel6.7 x86_64 which should fix the issue. https://lslebodn.fedorapeople.org/sssd_cron_issue_repo/ At least it fixed the issue for me with different reproducer. But it would be good if someone can confirm it with cron. Orion, Could you try to remove workaround (client_idle_timeout) and test with my packages? If someone else want to test sssd packages on different platform then I can prepare packages as well. Sure, I can run it. Would you mind running createrepo on that directory? (In reply to Orion Poplawski from comment #23) > Sure, I can run it. Would you mind running createrepo on that directory? Done Thanks. Installed on the machine in the original report with the workaround removed. I'll keep an eye on it. They are back: Mar 3 20:44:01 netdrmsdb crond[1820]: (sumsmgr) ERROR (getpwnam() failed): Broken pipe Mar 3 22:49:01 netdrmsdb crond[1820]: (sumsmgr) ERROR (getpwnam() failed): Broken pipe Mar 4 04:55:01 netdrmsdb crond[1820]: (sumsmgr) ERROR (getpwnam() failed): Broken pipe Have you restarted crond? Because change was in sssd-client package which is loaded by glibc on startup of cron and is not reloaded after restarting sssd daemon. Ah, no I hadn't. Restarted now... Looks good after restart. *** Bug 1142351 has been marked as a duplicate of this bug. *** How to reproduce: It isn't trivial to reproduce this race condition. But I described it in upstream and wrote simpler reproducer that configuration of crond: https://lists.fedorahosted.org/archives/list/sssd-devel@lists.fedorahosted.org/thread/47GIYDLLOEQ5VVIPHJ3W3WIBEKILLQOP/ The simple C reproducer is not suitable for automation because program will never finish with fixed sssd. master: * 6748a4c9d75db997c724c1dcea541e0047742f52 * a452d199bc125e8d53033d7c00383b4a275ab85e sssd-1-13: * 710d6a9e66b08990ce10cf72e7cc6c8c023806ea * c79e5f222e8de1ddc1cc260c1f3e5e204361580e Hi, Marking bug and sanity only verified as per above comments. Thanks, Niru I can confirm that the patched version by Lukas fixes the problem on RHEL6. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0782.html |