Bug 1209600

Summary:	Getting ERROR (getpwnam() failed): Broken pipe with 1.11.6
Product:	Red Hat Enterprise Linux 6	Reporter:	Orion Poplawski <orion>
Component:	sssd	Assignee:	Lukas Slebodnik <lslebodn>
Status:	CLOSED ERRATA	QA Contact:	Namita Soman <nsoman>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	6.6	CC:	andrew.prowant, davids, grajaiya, hartsjc, jgalipea, jhrozek, ktadimar, lslebodn, mdshaikh, mkosek, mzidek, nsoman, orion, paulds, pbrezina, preichl, rgroten, ruben, sgoveas, ssorce, striker, tlavigne, unix_admins
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	sssd-1.13.3-18.el6	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-05-10 20:21:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1075802, 1159926, 1172231

Description Orion Poplawski 2015-04-07 18:02:02 UTC

Description of problem:

Since some of our VMs have updated to 6.6, we're seeing in /var/log/cron:

Apr  7 11:08:01 netdrmsdb crond[1579]: (sumsmgr) ERROR (getpwnam() failed): Broken pipe

Version-Release number of selected component (if applicable):
cronie-1.4.4-15.el6.x86_64
sssd-1.11.6-30.el6_6.4.x86_64

With client_idle_timeout set to a default of 60 seconds, we seem to have a race with crond that checks credentials once a minute.


(Tue Apr  7 11:31:01 2015) [sssd[nss]] [get_client_cred] (0x4000): Client creds: euid[0] egid[0] pid[25230].
(Tue Apr  7 11:31:01 2015) [sssd[nss]] [reset_idle_timer] (0x4000): Idle timer re-set for client [0xcd3890][20]
(Tue Apr  7 11:31:01 2015) [sssd[nss]] [accept_fd_handler] (0x0400): Client connected!


(Tue Apr  7 11:32:01 2015) [sssd[nss]] [idle_handler] (0x2000): Terminating idle client [0xcd3890][20]
(Tue Apr  7 11:32:01 2015) [sssd[nss]] [client_destructor] (0x2000): Terminated client [0xcd3890][20]
(Tue Apr  7 11:32:01 2015) [sssd[nss]] [get_client_cred] (0x4000): Client creds: euid[0] egid[0] pid[25230].
(Tue Apr  7 11:32:01 2015) [sssd[nss]] [accept_fd_handler] (0x0400): Client connected!

This default should be something like 75 seconds to avoid this.

https://access.redhat.com/solutions/1125133 references similar behavior but seems to indicate that this only happens when sssd restarts sssd_nss which is not the case here.

Other ideas:

- libnss_sss perhaps should pass EPIPE back up to getpwnam() and instead retry.
- crond should retry on receiving EPIPE.

Comment 2 Jakub Hrozek 2015-04-07 19:20:25 UTC

This sounds an awful lot like https://bugzilla.redhat.com/show_bug.cgi?id=1142351. Since the suspected error was in the client piece, the best debug info would be strace of crond while checking the credentials.

Do you know what exactly crond does? Feel free to point us to the code..if not, we can involve the crond maintainer, but it seems like you did quite a bit of research already (thanks!)

Comment 3 Orion Poplawski 2015-04-07 19:49:09 UTC

I'm not authorized for 1142351, so I can't comment on that.  

cron code is from cronie-1.4.4/src/cron.c:506

        /* check if user exists in time of job is being run f.e. ldap */
        if ((pw = getpwnam(uname)) != NULL) {
             ... run a job if scheduled ....
        }
        else {
             log_it(uname, getpid(), "ERROR", "getpwnam() failed", errno);
        }

So if the user exists, run a cron if applicable.  Otherwise log the error (which is a new patch).  So all errors are treated as fatal.  I suppose one might argue that one could also retry on EINTR, so perhaps a retry in the cron code is worth while.

With the current configuration I think this only happens when client_idle_timeout is a multiple of 60, thus allowing sssd_nss to close the connection as libnss_sss is writing to it.

I suspect though that in sssd-1.11.6/src/sss_client/common.c sss_cli_send_req() you want to also want to handle EPIPE, and re-establish a connection if you get it, if that is possible:

       errno = 0;
        if (datasent < SSS_NSS_HEADER_SIZE) {
            res = send(sss_cli_sd,
                       (char *)header + datasent,
                       SSS_NSS_HEADER_SIZE - datasent,
                       SSS_DEFAULT_WRITE_FLAGS);
        } else {
            rdsent = datasent - SSS_NSS_HEADER_SIZE;
            res = send(sss_cli_sd,
                       (const char *)rd->data + rdsent,
                       rd->len - rdsent,
                       SSS_DEFAULT_WRITE_FLAGS);
        }
        error = errno;

        if ((res == -1) || (res == 0)) {
            if ((error == EINTR) || error == EAGAIN) {
                /* If the write was interrupted, go back through
                 * the loop and try again
                 */
                continue;
            }

            /* Write failed */
            sss_cli_close_socket();
            *errnop = error;
            return SSS_STATUS_UNAVAIL;
        }


I'll try to get a better strace - the one line I have now does seem to support this though;

1579  11:25:01 sendto(6, "\30\0\0\0\21\0\0\0\0\0\0\0\0\0\0\0", 16, MSG_NOSIGNAL, NULL, 0) = -1 EPIPE (Broken pipe)

Apr  7 11:25:01 netdrmsdb crond[1579]: (sumsmgr) ERROR (getpwnam() failed): Broken pipe

Comment 4 Orion Poplawski 2015-04-08 15:34:38 UTC

Yeah, here's some more:

09:06:01 select(6, [5], NULL, NULL, {0, 0}) = 0 (Timeout)
09:06:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 8
09:06:01 fstat(8, {st_mode=S_IFREG|0644, st_size=1589, ...}) = 0
...
09:06:01 close(8)                       = 0
09:06:01 munmap(0x7f11c56cc000, 4096)   = 0
09:06:01 poll([{fd=6, events=POLLIN|POLLOUT}], 1, 300000) = 1 ([{fd=6, revents=POLLOUT}])
09:06:01 poll([{fd=6, events=POLLOUT}], 1, 300000) = 1 ([{fd=6, revents=POLLOUT}])
09:06:01 sendto(6, "\31\0\0\0\21\0\0\0\0\0\0\0\0\0\0\0", 16, MSG_NOSIGNAL, NULL, 0) = -1 EPIPE (Broken pipe)
09:06:01 close(6)                       = 0
09:06:01 gettimeofday({1428505561, 789678}, NULL) = 0
09:06:01 sendto(4, "<75>Apr  8 09:06:01 crond[25230]"..., 83, MSG_NOSIGNAL, NULL, 0) = 83

So select/poll indicates no issues, but socket is closed when it goes to write to it.

Comment 5 Jakub Hrozek 2015-04-13 09:03:57 UTC

Simo, do you have some immediate thoughts?

Comment 6 Simo Sorce 2015-04-13 13:54:59 UTC

(In reply to Jakub Hrozek from comment #5)
> Simo, do you have some immediate thoughts?

We should handle EPIPE, if sssd_nss crashes and is restarted while a client is trying to send info we'd probably get EPIPE.

While thinking about that I think we should retry also if we get EPIPE on recv(), but in that case we cannot just retry the recv().
In gneral when we get EPIPE we do not know what the server recieved so we  should retry the whole operation.

If we add EPIPE handling though the number of retries should be limited though, to avoid infinite loops (both on send() and recv() errors).

Comment 7 Jakub Hrozek 2015-04-14 11:01:09 UTC

Thanks, I'll clone upstream.

Comment 8 Jakub Hrozek 2015-04-14 11:02:47 UTC

Upstream ticket:
https://fedorahosted.org/sssd/ticket/2626

Comment 11 Lukas Slebodnik 2015-11-23 10:29:31 UTC

Is it still reproducible on rhel6.7?

Is there problem with all cron job for users from sssd?

Could you provide example?

Comment 12 Orion Poplawski 2015-11-23 22:38:49 UTC

(In reply to Lukas Slebodnik from comment #11)
> Is it still reproducible on rhel6.7?

Yes.  And I don't see anything noted in the upstream ticket about any work on it.

> Is there problem with all cron job for users from sssd?

Yes.

> Could you provide example?

I'm not sure what more is needed that what is already provided.

Comment 14 Lukas Slebodnik 2016-01-15 12:56:29 UTC

Orion,
I'm sorry for asking you one more time but I;m not able to reproduce this bug  on minimal rhel6.7 even though you gave me an advice how to reproduce it.

Here is an output of crontab for sssd user
-bash-4.1$ crontab -l
* *  *  *  *       /bin/ls -l /tmp
* * * * * /usr/bin/touch /tmp/pok-`date +%s`
* * * * * /bin/touch /tmp/pok-`date +%s`

I can see many "termination of client" in sssd logs but there isn't any error in cron log.

[root@host ~]# grep "Terminating idle client" /var/log/sssd/sssd_nss.log | wc -l
152

[root@host ~]# echo $?
1
[root@host ~]# wc -l /var/log/cron
8817 /var/log/cron



So there must be some missing info (different configuration ...).
Do you have default configuration of cronie, cronie-anacron, crontabs ...

I'm sorry but I cannot fix the bug without reproducer.

Do you have any idea about more reliable reproducer?
I would be thankful if it happened once or twice a day.

If you do not have an idea could you provide log file from sssd with miliseconds debugging + the same for strace output. The output from strace in Comment 4 is not enough for reproduce.

Put folwing lines to sssd.conf into [nss] section
debug_level = 9
debug_microseconds = true

and call strace
strace -ttt -o strace_crond -p `pgrep crond`

I would like to apologize one more time but deadlines are approaching and I would like to fix this bug.

Comment 15 Lukas Slebodnik 2016-01-15 12:59:04 UTC

and it would be good to know timestamps in crond log with ERROR messages

Comment 16 Orion Poplawski 2016-01-15 16:51:08 UTC

I'm not sure how much more time I can put into this either as we have a workaround.  Perhaps there needs to be some other load on the machine to trigger it?  Systems are fairly standard config.  ldap domain:

[domain/default]
auth_provider = ldap
cache_credentials = True
ldap_search_base = dc=nwra,dc=com
ldap_autofs_search_base = dc=cora,dc=nwra,dc=com
id_provider = ldap
ldap_id_use_start_tls = True
chpass_provider = ldap
ldap_chpass_uri = ldap://xxx/
ldap_uri = ldap://xxx/,ldap://xxx/
ldap_tls_cacertdir = /etc/openldap/cacerts
ldap_account_expire_policy = 389ds

[sssd]
services = nss, pam, autofs
config_file_version = 2
domains = default

Comment 17 Venkata Tadimarri 2016-01-16 03:00:54 UTC

I tried to reproduce the issue with a simple script and wasnt able to .. let me check with the customer on Monday and update you on the load and timings

Comment 18 Jakub Hrozek 2016-01-20 18:35:48 UTC

We can't reproduce the issue in-house, so I'm re-setting the flags for 6.9 for now.

We can reconsider this bugzilla for 6.8 if we know how to reproduce the issue, but so far we haven't been able to, sorry..

Comment 19 Paul Stauffer 2016-01-21 18:07:15 UTC

We have been seeing the same behavior on CentOS 7 systems, with sssd-1.13.0-40.el7_2.1.x86_64.  Same scenario: user accounts in LDAP, per-minute user cron job, default sssd config.

It is likely a difficult problem to reproduce on demand; with a 1/minute cron job, we're only seeing this happen a couple times a day.

I can confirm that the solution described in the solutions article above <https://access.redhat.com/solutions/1125133> does essentially resolve the issue.  We added "client_idle_timeout = 67" to the "[nss]" section of /etc/sssd/sssd.conf, and restarted the sssd service, and the errors have ceased.  This doesn't technically resolve the issue of sssd's pipe handling, but it should resolve this timing conflict with cron.

Seems like it might be sensible to change the default value of client_idle_timeout on this basis alone.  It's a simple change, there should be no negative consequences, no code change is required, and it will mitigate the issue in the vast majority of cases.

Comment 20 Lukas Slebodnik 2016-02-26 20:50:15 UTC

I prepared packages for rhel6.7 x86_64 which should fix the issue.
https://lslebodn.fedorapeople.org/sssd_cron_issue_repo/

At least it fixed the issue for me with different reproducer.
But it would be good if someone can confirm it with cron.

Orion,
Could you try to remove workaround (client_idle_timeout) and test with my packages?

If someone else want to test sssd packages on different platform
then I can prepare packages as well.

Comment 23 Orion Poplawski 2016-03-03 19:11:26 UTC

Sure, I can run it.  Would you mind running createrepo on that directory?

Comment 24 Lukas Slebodnik 2016-03-03 20:34:12 UTC

(In reply to Orion Poplawski from comment #23)
> Sure, I can run it.  Would you mind running createrepo on that directory?

Done

Comment 25 Orion Poplawski 2016-03-03 20:45:51 UTC

Thanks.  Installed on the machine in the original report with the workaround removed.  I'll keep an eye on it.

Comment 26 Orion Poplawski 2016-03-04 17:36:46 UTC

They are back:

Mar  3 20:44:01 netdrmsdb crond[1820]: (sumsmgr) ERROR (getpwnam() failed): Broken pipe
Mar  3 22:49:01 netdrmsdb crond[1820]: (sumsmgr) ERROR (getpwnam() failed): Broken pipe
Mar  4 04:55:01 netdrmsdb crond[1820]: (sumsmgr) ERROR (getpwnam() failed): Broken pipe

Comment 27 Lukas Slebodnik 2016-03-04 19:30:40 UTC

Have you restarted crond?
Because change was in sssd-client package which is loaded by glibc on startup of cron and is not reloaded after restarting sssd daemon.

Comment 28 Orion Poplawski 2016-03-04 19:38:22 UTC

Ah, no I hadn't.  Restarted now...

Comment 29 Orion Poplawski 2016-03-07 17:05:32 UTC

Looks good after restart.

Comment 30 Lukas Slebodnik 2016-03-10 14:48:50 UTC

*** Bug 1142351 has been marked as a duplicate of this bug. ***

Comment 31 Lukas Slebodnik 2016-03-10 15:07:35 UTC

How to reproduce:

It isn't trivial to reproduce this race condition. But I described it in upstream and wrote simpler reproducer that configuration of crond:
https://lists.fedorahosted.org/archives/list/sssd-devel@lists.fedorahosted.org/thread/47GIYDLLOEQ5VVIPHJ3W3WIBEKILLQOP/

The simple C reproducer is not suitable for automation
because program will never finish with fixed sssd.

Comment 33 Lukas Slebodnik 2016-03-11 10:50:16 UTC

master:
* 6748a4c9d75db997c724c1dcea541e0047742f52
* a452d199bc125e8d53033d7c00383b4a275ab85e

sssd-1-13:
* 710d6a9e66b08990ce10cf72e7cc6c8c023806ea
* c79e5f222e8de1ddc1cc260c1f3e5e204361580e

Comment 37 Nirupama Karandikar 2016-03-22 11:31:18 UTC

Hi,

Marking bug and sanity only verified as per above comments.

Thanks,
Niru

Comment 40 Andrew Prowant 2016-04-12 15:25:05 UTC

I can confirm that the patched version by Lukas fixes the problem on RHEL6.

Comment 42 errata-xmlrpc 2016-05-10 20:21:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0782.html