1631998 – kinit dies silently from a "broken pipe" signal.

Bug 1631998 - kinit dies silently from a "broken pipe" signal.

Summary: kinit dies silently from a "broken pipe" signal.

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	krb5
Sub Component:
Version:	27
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Robbie Harwood
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:	https://github.com/krb5/krb5/pull/859
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-09-23 05:20 UTC by Björn Persson
Modified:	2018-10-17 19:29 UTC (History)
CC List:	8 users (show)
Fixed In Version:	krb5-1.16.1-24.fc30
Clone Of:
Environment:
Last Closed:	2018-10-17 19:29:15 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
trace messages from kinit (2.55 KB, text/plain) 2018-09-23 05:20 UTC, Björn Persson	no flags	Details
View All

Description Björn Persson 2018-09-23 05:20:32 UTC

Created attachment 1486135 [details]
trace messages from kinit

Description of problem:
When KCM has the problem in bug 1537866, kinit fails to catch the error and report it.

Version-Release number of selected component (if applicable):
[root@tag ~]# rpm -q --file `which kinit`
krb5-workstation-1.15.2-9.fc27.x86_64

Actual results:
There is no error message from kinit, it just terminates silently with exit code 141, which suggests that it dies from a "broken pipe" signal.

Expected results:
kinit should detect the error and write an informative error message to the standard error stream to give the user an idea of where to start looking for the problem.

Comment 1 Robbie Harwood 2018-10-09 16:53:59 UTC

The last message in the trace is:

[2427] 1516758589.107701: Initializing KCM:1000 with default princ terrycloth

and then we subsequently receive a broken pipe.  That suggests that the error is in KCM (which it is) and then the program dies from broken pipe signal (which is what went wrong).

We don't register signal handlers in kinit.  Actually logging a message here, while slightly prettier, doesn't provide any more information than what we already have.  To my mind that's not worth the complexity of handling signals here.

Comment 2 Björn Persson 2018-10-10 16:41:23 UTC

(In reply to Robbie Harwood from comment #1)
> That suggests that the error is in KCM (which it is)

If your stance is that it's wrong for KCM to close the connection, then you owe the user an error message. The SSSD guys seem to think that KCM is allowed to close the connection, in which case kinit must be prepared to reconnect as needed – and handle the situation in case the reconnection fails, which will also require an error message at some point.

> Actually logging a message
> here, while slightly prettier, doesn't provide any more information than
> what we already have.

I don't understand how you can say that. Do you expect users to always run programs under strace just to find out whether they succeed or fail?

If I had gotten an error message from kinit, then I wouldn't have needed to wonder why fedpkg said that authentication failed even after I had authenticated through kinit. I wouldn't have needed to waste time searching the web for documentation of exit codes from kinit. I wouldn't have needed to ask on the devel mailing list what the exit code 141 means, only to learn that I should have subtracted 128 and looked for a signal number instead. An accurate error message like "connection to KCM lost" would have told me right away that I should look for a problem related to something called "KCM". That would have saved a lot of time for me, for those who replied on the mailing list, for those who posted in bug 1537866 before me, and undoubtedly for many other users.

> To my mind that's not worth the complexity of handling signals here.

It's not necessary to have a signal handler to handle a lost connection. You only need to call "signal(SIGPIPE, SIG_IGN)" to ignore the signal, and then, when writev returns -1, compare errno to EPIPE.

Comment 3 Robbie Harwood 2018-10-10 18:10:06 UTC

sssd-kcm is actually also responsible for the part of krb5 that talks to the KCM daemon.  Whatever decisions it makes with respect to sockets are up to it.

You didn't need to run the program under strace.  But yes, it's normal for internal errors to require enabling debugging; in the case of krb5, because it's a library, this is controlled with the KRB5_TRACE environment variable.

Comment 4 Jakub Hrozek 2018-10-11 10:07:38 UTC

Let me add some context..

There is this thing called "idle timeout" in all SSSD daemons. The idle timeout is a  mechanism that protects the daemons from clients that would just connect, do nothing and starve the SSSD daemons out of fds. So if a client doesn't do anything past the idle timeout, sssd just closes that fd.

For most SSSD clients, this timeout can be quite aggressive, because it is not expected that e.g. a username resolution would take a long time, but also because the client is often part of SSSD as well (example: for the NSS operations, the sss_nss module is the client) and the client tries to reconnect to the daemon socket if an IO operation fails with EPIPE.

Now, there was a bug (caused by a downstream Fedora-specific patch :-/) that broke the idle timeout for KCM. This is of course something to fix on the KCM side, no argument about that. We also talked about increasing the idle timeout for KCM, because the interaction with KCM might often take a long time (what is my password again? Let me open my password manager...oh, a sigpipe..)

(In reply to Robbie Harwood from comment #3)
> sssd-kcm is actually also responsible for the part of krb5 that talks to the
> KCM daemon.  Whatever decisions it makes with respect to sockets are up to
> it.

Yes, but wouldn't it make libkrb5 more robust to retry the socket operation? 

I understand the reasoning that libkrb5 is a library and as such it has no business setting e.g. signal mark or handler for signals for the application. But the way this is handled in the SSSD client libraries is that the send() call also uses the MSG_NOSIGNAL flags. Looking at man 2 send, this seems like a Linux-specific thing, but I wonder if it is worth a try to improve the situation for Linux users of libkrb5 at least?

> 
> You didn't need to run the program under strace.  But yes, it's normal for
> internal errors to require enabling debugging; in the case of krb5, because
> it's a library, this is controlled with the KRB5_TRACE environment variable.

Comment 5 Robbie Harwood 2018-10-12 21:31:50 UTC

> But the way this is handled in the SSSD client libraries is that the send() call also uses the MSG_NOSIGNAL flags. Looking at man 2 send, this seems like a Linux-specific thing, but I wonder if it is worth a try to improve the situation for Linux users of libkrb5 at least?

This isn't a bad idea.  Proposed upstream.  (It's more complicated because we have to start using send() before we can pass flags to it; we've been using writev().  And also support BSD and Windows etc. etc.)

> Yes, but wouldn't it make libkrb5 more robust to retry the socket operation?

I can see an argument for reconnecting.  Is there a point to doing so more than once?

Also, we'd only be able to retry  on failure of writes, I think - the KCM remote won't do anything useful if we connect and then read.

Comment 6 Jakub Hrozek 2018-10-15 06:44:22 UTC

(In reply to Robbie Harwood from comment #5)
> > But the way this is handled in the SSSD client libraries is that the send() call also uses the MSG_NOSIGNAL flags. Looking at man 2 send, this seems like a Linux-specific thing, but I wonder if it is worth a try to improve the situation for Linux users of libkrb5 at least?
> 
> This isn't a bad idea.  Proposed upstream.  (It's more complicated because
> we have to start using send() before we can pass flags to it; we've been
> using writev().  And also support BSD and Windows etc. etc.)
> 

Thank you very much.

> > Yes, but wouldn't it make libkrb5 more robust to retry the socket operation?
> 
> I can see an argument for reconnecting.  Is there a point to doing so more
> than once?
> 

No, I think retry once is what you want to do. If the socket can't be reopened, just fail.

> Also, we'd only be able to retry  on failure of writes, I think - the KCM
> remote won't do anything useful if we connect and then read.

Yes, I also agree here.

Comment 7 Robbie Harwood 2018-10-17 19:29:15 UTC

SIGPIPE workaround sent to rawhide.  Will backport retry if it merges upstream.

Note You need to log in before you can comment on or make changes to this bug.