Bug 1609382 - sssd_ssh runs out of file descriptors and stops working
Summary: sssd_ssh runs out of file descriptors and stops working
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: sssd
Version: 28
Hardware: All
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Jakub Hrozek
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 1610667
TreeView+ depends on / blocked
 
Reported: 2018-07-27 20:14 UTC by Orion Poplawski
Modified: 2020-05-02 18:59 UTC (History)
10 users (show)

Fixed In Version: sssd-1.16.3-1.fc28
Clone Of:
: 1610667 (view as bug list)
Environment:
Last Closed: 2018-08-29 20:12:15 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
strace -f -p $(pidof sssd_ssh) -s 512 -o /tmp/sssd_ssh.strace (338.23 KB, application/x-gzip)
2018-07-31 16:41 UTC, Orion Poplawski
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github SSSD sssd issues 4790 0 None closed sssd_ssh leaks file descriptors when more than one certificate is converted into an SSH key 2020-09-30 04:04:46 UTC

Description Orion Poplawski 2018-07-27 20:14:54 UTC
Description of problem:

sssd_ssh accumulates lots of open pipe file descriptors, e.g:

sssd_ssh 25415 root   68r     FIFO               0,12      0t0    707481 pipe
sssd_ssh 25415 root   69w     FIFO               0,12      0t0    707398 pipe
sssd_ssh 25415 root   71w     FIFO               0,12      0t0    707473 pipe
sssd_ssh 25415 root   73w     FIFO               0,12      0t0    707482 pipe

eventually (quickly) it runs out and stops working with:

(Fri Jul 27 11:09:16 2018) [sssd[ssh]] [cert_to_ssh_key_step] (0x0020): pipe failed [24][Too many open files].
(Fri Jul 27 11:09:16 2018) [sssd[ssh]] [ssh_get_output_keys_done] (0x0040): cert_to_ssh_key request failed.

They seem to be triggered by execution of /usr/bin/sss_ssh_authorizedkeys and perhaps other things.

This is with an IPA<->AD trust and we are returning ssh keys from AD user certificates.

Version-Release number of selected component (if applicable):
sssd-1.16.2-4.fc28.x86_64


strace shows:

25415 pipe([118, 120])                  = 0
25415 pipe([122, 123])                  = 0
25415 clone( <unfinished ...>
1771  set_robust_list(0x7f34310acbe0, 24 <unfinished ...>
25415 <... clone resumed> child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f34310acbd0) = 1771
1771  <... set_robust_list resumed> )   = 0
25415 close(120 <unfinished ...>
1771  close(123 <unfinished ...>
25415 <... close resumed> )             = 0
1771  <... close resumed> )             = 0
25415 fcntl(118, F_GETFL <unfinished ...>
1771  dup2(122, 0 <unfinished ...>
25415 <... fcntl resumed> )             = 0 (flags O_RDONLY)
1771  <... dup2 resumed> )              = 0
25415 fcntl(118, F_SETFL, O_RDONLY|O_NONBLOCK <unfinished ...>
1771  close(118 <unfinished ...>
25415 <... fcntl resumed> )             = 0
1771  <... close resumed> )             = 0
25415 close(122 <unfinished ...>
1771  dup2(120, 1 <unfinished ...>
25415 <... close resumed> )             = 0
1771  <... dup2 resumed> )              = 1
25415 fcntl(123, F_GETFL <unfinished ...>
1771  execve("/usr/libexec/sssd/p11_child", ["/usr/libexec/sssd/p11_child", "--debug-microseconds=0", "--debug-timestamps=1", "--debug-fd=2", "--debug-level=0xf7f0", "--verification", "--nssdb", "/etc/pki/nssdb", "--certificate", "XXX"
25415 <... fcntl resumed> )             = 0x1 (flags O_WRONLY)
25415 fcntl(123, F_SETFL, O_WRONLY|O_NONBLOCK) = 0
1771  <... execve resumed> )            = 0
25415 write(7, "(Fri Jul 27 12:53:17 2018) [sssd[ssh]] [child_handler_setup] (0x2000): Setting up signal handler up for pid [1771]\n", 115 <unfinished ...>


But lsof shows:

sssd_ssh 25415 root  118r     FIFO               0,12      0t0    720911 pipe
sssd_ssh 25415 root  119w     FIFO               0,12      0t0    719651 pipe
sssd_ssh 25415 root  120r     FIFO               0,12      0t0    721018 pipe
sssd_ssh 25415 root  121w     FIFO               0,12      0t0    719660 pipe
sssd_ssh 25415 root  122r     FIFO               0,12      0t0    721100 pipe
sssd_ssh 25415 root  123w     FIFO               0,12      0t0    720912 pipe

Looks like CLONE_FILES is not being set but the FDs are being closed in the child process?

I don't see the issue in sssd-1.16.0-19.el7_5.5, and strace shows a completely different trace there.

Comment 1 Orion Poplawski 2018-07-27 20:25:42 UTC
Well, sssd is just calling fork(), so I think it's just a matter of cert_to_ssh_key_step not closing the needed FDs in the parent.

Comment 2 Orion Poplawski 2018-07-27 20:31:46 UTC
(Fri Jul 27 13:29:53 2018) [sssd[ssh]] [cache_req_search_send] (0x0400): CR #0: Returning [USER.com] from cache
(Fri Jul 27 13:29:53 2018) [sssd[ssh]] [cache_req_search_ncache_filter] (0x0400): CR #0: This request type does not support filtering result by negative cache
(Fri Jul 27 13:29:53 2018) [sssd[ssh]] [cache_req_create_and_add_result] (0x0400): CR #0: Found 1 entries in domain ad.nwra.com
(Fri Jul 27 13:29:53 2018) [sssd[ssh]] [cache_req_done] (0x0400): CR #0: Finished: Success
(Fri Jul 27 13:29:53 2018) [sssd[ssh]] [child_sig_handler] (0x1000): Waiting for child [4203].
(Fri Jul 27 13:29:53 2018) [sssd[ssh]] [child_sig_handler] (0x0100): child [4203] finished successfully.
(Fri Jul 27 13:29:53 2018) [sssd[ssh]] [cert_to_ssh_key_done] (0x1000): Certificate [MIIXXX] is valid.
(Fri Jul 27 13:29:53 2018) [sssd[ssh]] [child_sig_handler] (0x1000): Waiting for child [4210].
(Fri Jul 27 13:29:53 2018) [sssd[ssh]] [child_sig_handler] (0x0100): child [4210] finished successfully.
(Fri Jul 27 13:29:53 2018) [sssd[ssh]] [cert_to_ssh_key_done] (0x1000): Certificate [MIIXXX] is valid.
(Fri Jul 27 13:29:53 2018) [sssd[ssh]] [child_sig_handler] (0x1000): Waiting for child [4212].
(Fri Jul 27 13:29:53 2018) [sssd[ssh]] [child_sig_handler] (0x0100): child [4212] finished successfully.
(Fri Jul 27 13:29:53 2018) [sssd[ssh]] [cert_to_ssh_key_done] (0x1000): Certificate [MIIXXX] is valid.
(Fri Jul 27 13:29:53 2018) [sssd[ssh]] [client_recv] (0x0200): Client disconnected!

Comment 3 Orion Poplawski 2018-07-27 20:34:24 UTC
I suspect this check is not right:

done:
    if (ret != EOK) {
        PIPE_CLOSE(pipefd_from_child);
        PIPE_CLOSE(pipefd_to_child);
    }


After the pipes are created, you are always going to want to close them regardless of other errors.

Comment 4 Orion Poplawski 2018-07-27 20:39:54 UTC
I see other examples of code like this:

src/providers/ad/ad_machine_pw_renewal.c:ad_machine_account_password_renewal_send()
src/providers/be_dyndns.c:be_nsupdate_send()
src/responder/pam/pamsrv_p11.c:pam_check_cert_send()
src/util/cert/cert_common_p11_child.c:cert_to_ssh_key_step()

but PIPE_CLOSE->PIPE_FD_CLOSE() has checks and so should be safe to call at any time.

Comment 5 Orion Poplawski 2018-07-27 21:33:58 UTC
Okay, I'll try to stop making a fool of myself pretending I understand the code and let someone else figure this out...

Comment 6 Jakub Hrozek 2018-07-31 09:17:55 UTC
I'm sorry, but so far I can't reproduce this issue. Code-wise the pipes to the p11 child should be closed using a destructor:

 77     state->io = talloc(state, struct child_io_fds);
 78     if (state->io == NULL) {
 79         DEBUG(SSSDBG_OP_FAILURE, "talloc failed.\n");
 80         ret = ENOMEM;
 81         goto done;
 82     }
 83     state->io->write_to_child_fd = -1;
 84     state->io->read_from_child_fd = -1;
 85     talloc_set_destructor((void *) state->io, child_io_destructor);

Can you show more context from an strace run that captures some of the leaks? Can you also paste the output of "lsof -E -p $(pidof sssd_ssh)" that matches the strace? The -E argument should print some more useful info about the pipe.

Comment 7 Orion Poplawski 2018-07-31 16:41:58 UTC
Created attachment 1471881 [details]
strace -f -p $(pidof sssd_ssh) -s 512 -o /tmp/sssd_ssh.strace

I'm not seeing any more useful info from lsof -E, but here it is:

sssd_ssh 1911 root   22u     unix 0x00000000470574d6       0t0  548765 /var/lib/sss/pipes/ssh type=STREAM
sssd_ssh 1911 root   24r     FIFO               0,12       0t0  559448 pipe
sssd_ssh 1911 root   25r     FIFO               0,12       0t0  558814 pipe
sssd_ssh 1911 root   27w     FIFO               0,12       0t0  559449 pipe
sssd_ssh 1911 root   29w     FIFO               0,12       0t0  558815 pipe

strace attached

I ran:

/usr/bin/sss_ssh_authorizedkeys orion

to reproduce the issue.

Comment 8 Lukas Slebodnik 2018-07-31 21:06:33 UTC
(In reply to Jakub Hrozek from comment #6)
> I'm sorry, but so far I can't reproduce this issue. Code-wise the pipes to
> the p11 child should be closed using a destructor:
> 
>  77     state->io = talloc(state, struct child_io_fds);
>  78     if (state->io == NULL) {
>  79         DEBUG(SSSDBG_OP_FAILURE, "talloc failed.\n");
>  80         ret = ENOMEM;
>  81         goto done;
>  82     }
>  83     state->io->write_to_child_fd = -1;
>  84     state->io->read_from_child_fd = -1;
>  85     talloc_set_destructor((void *) state->io, child_io_destructor);
> 
> Can you show more context from an strace run that captures some of the
> leaks? Can you also paste the output of "lsof -E -p $(pidof sssd_ssh)" that
> matches the strace? The -E argument should print some more useful info about
> the pipe.

Here is a patch https://pagure.io/SSSD/sssd/pull-request/3793
And IIRC Orion already tried that.

BTW it was introduced in 1.16.2

Comment 9 Jakub Hrozek 2018-08-01 07:12:12 UTC
(In reply to Lukas Slebodnik from comment #8)
> (In reply to Jakub Hrozek from comment #6)
> > I'm sorry, but so far I can't reproduce this issue. Code-wise the pipes to
> > the p11 child should be closed using a destructor:
> > 
> >  77     state->io = talloc(state, struct child_io_fds);
> >  78     if (state->io == NULL) {
> >  79         DEBUG(SSSDBG_OP_FAILURE, "talloc failed.\n");
> >  80         ret = ENOMEM;
> >  81         goto done;
> >  82     }
> >  83     state->io->write_to_child_fd = -1;
> >  84     state->io->read_from_child_fd = -1;
> >  85     talloc_set_destructor((void *) state->io, child_io_destructor);
> > 
> > Can you show more context from an strace run that captures some of the
> > leaks? Can you also paste the output of "lsof -E -p $(pidof sssd_ssh)" that
> > matches the strace? The -E argument should print some more useful info about
> > the pipe.
> 
> Here is a patch https://pagure.io/SSSD/sssd/pull-request/3793
> And IIRC Orion already tried that.
> 
> BTW it was introduced in 1.16.2

Thank you for the patch. Because I had no idea you are also working at the issue, I also wrote a patch, just different, which still uses the destructors.

We can discuss which one we should use in the PR.

Comment 10 Jakub Hrozek 2018-08-02 10:55:39 UTC
master: a76f96ac143128c11bdb975293d667aca861cd91

Comment 11 Jakub Hrozek 2018-08-29 13:13:17 UTC
Upstream ticket:
https://pagure.io/SSSD/sssd/issue/3794


Note You need to log in before you can comment on or make changes to this bug.