Description of problem: sssd_ssh accumulates lots of open pipe file descriptors, e.g: sssd_ssh 25415 root 68r FIFO 0,12 0t0 707481 pipe sssd_ssh 25415 root 69w FIFO 0,12 0t0 707398 pipe sssd_ssh 25415 root 71w FIFO 0,12 0t0 707473 pipe sssd_ssh 25415 root 73w FIFO 0,12 0t0 707482 pipe eventually (quickly) it runs out and stops working with: (Fri Jul 27 11:09:16 2018) [sssd[ssh]] [cert_to_ssh_key_step] (0x0020): pipe failed [24][Too many open files]. (Fri Jul 27 11:09:16 2018) [sssd[ssh]] [ssh_get_output_keys_done] (0x0040): cert_to_ssh_key request failed. They seem to be triggered by execution of /usr/bin/sss_ssh_authorizedkeys and perhaps other things. This is with an IPA<->AD trust and we are returning ssh keys from AD user certificates. Version-Release number of selected component (if applicable): sssd-1.16.2-4.fc28.x86_64 strace shows: 25415 pipe([118, 120]) = 0 25415 pipe([122, 123]) = 0 25415 clone( <unfinished ...> 1771 set_robust_list(0x7f34310acbe0, 24 <unfinished ...> 25415 <... clone resumed> child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f34310acbd0) = 1771 1771 <... set_robust_list resumed> ) = 0 25415 close(120 <unfinished ...> 1771 close(123 <unfinished ...> 25415 <... close resumed> ) = 0 1771 <... close resumed> ) = 0 25415 fcntl(118, F_GETFL <unfinished ...> 1771 dup2(122, 0 <unfinished ...> 25415 <... fcntl resumed> ) = 0 (flags O_RDONLY) 1771 <... dup2 resumed> ) = 0 25415 fcntl(118, F_SETFL, O_RDONLY|O_NONBLOCK <unfinished ...> 1771 close(118 <unfinished ...> 25415 <... fcntl resumed> ) = 0 1771 <... close resumed> ) = 0 25415 close(122 <unfinished ...> 1771 dup2(120, 1 <unfinished ...> 25415 <... close resumed> ) = 0 1771 <... dup2 resumed> ) = 1 25415 fcntl(123, F_GETFL <unfinished ...> 1771 execve("/usr/libexec/sssd/p11_child", ["/usr/libexec/sssd/p11_child", "--debug-microseconds=0", "--debug-timestamps=1", "--debug-fd=2", "--debug-level=0xf7f0", "--verification", "--nssdb", "/etc/pki/nssdb", "--certificate", "XXX" 25415 <... fcntl resumed> ) = 0x1 (flags O_WRONLY) 25415 fcntl(123, F_SETFL, O_WRONLY|O_NONBLOCK) = 0 1771 <... execve resumed> ) = 0 25415 write(7, "(Fri Jul 27 12:53:17 2018) [sssd[ssh]] [child_handler_setup] (0x2000): Setting up signal handler up for pid [1771]\n", 115 <unfinished ...> But lsof shows: sssd_ssh 25415 root 118r FIFO 0,12 0t0 720911 pipe sssd_ssh 25415 root 119w FIFO 0,12 0t0 719651 pipe sssd_ssh 25415 root 120r FIFO 0,12 0t0 721018 pipe sssd_ssh 25415 root 121w FIFO 0,12 0t0 719660 pipe sssd_ssh 25415 root 122r FIFO 0,12 0t0 721100 pipe sssd_ssh 25415 root 123w FIFO 0,12 0t0 720912 pipe Looks like CLONE_FILES is not being set but the FDs are being closed in the child process? I don't see the issue in sssd-1.16.0-19.el7_5.5, and strace shows a completely different trace there.
Well, sssd is just calling fork(), so I think it's just a matter of cert_to_ssh_key_step not closing the needed FDs in the parent.
(Fri Jul 27 13:29:53 2018) [sssd[ssh]] [cache_req_search_send] (0x0400): CR #0: Returning [USER.com] from cache (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [cache_req_search_ncache_filter] (0x0400): CR #0: This request type does not support filtering result by negative cache (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [cache_req_create_and_add_result] (0x0400): CR #0: Found 1 entries in domain ad.nwra.com (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [cache_req_done] (0x0400): CR #0: Finished: Success (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [child_sig_handler] (0x1000): Waiting for child [4203]. (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [child_sig_handler] (0x0100): child [4203] finished successfully. (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [cert_to_ssh_key_done] (0x1000): Certificate [MIIXXX] is valid. (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [child_sig_handler] (0x1000): Waiting for child [4210]. (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [child_sig_handler] (0x0100): child [4210] finished successfully. (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [cert_to_ssh_key_done] (0x1000): Certificate [MIIXXX] is valid. (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [child_sig_handler] (0x1000): Waiting for child [4212]. (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [child_sig_handler] (0x0100): child [4212] finished successfully. (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [cert_to_ssh_key_done] (0x1000): Certificate [MIIXXX] is valid. (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [client_recv] (0x0200): Client disconnected!
I suspect this check is not right: done: if (ret != EOK) { PIPE_CLOSE(pipefd_from_child); PIPE_CLOSE(pipefd_to_child); } After the pipes are created, you are always going to want to close them regardless of other errors.
I see other examples of code like this: src/providers/ad/ad_machine_pw_renewal.c:ad_machine_account_password_renewal_send() src/providers/be_dyndns.c:be_nsupdate_send() src/responder/pam/pamsrv_p11.c:pam_check_cert_send() src/util/cert/cert_common_p11_child.c:cert_to_ssh_key_step() but PIPE_CLOSE->PIPE_FD_CLOSE() has checks and so should be safe to call at any time.
Okay, I'll try to stop making a fool of myself pretending I understand the code and let someone else figure this out...
I'm sorry, but so far I can't reproduce this issue. Code-wise the pipes to the p11 child should be closed using a destructor: 77 state->io = talloc(state, struct child_io_fds); 78 if (state->io == NULL) { 79 DEBUG(SSSDBG_OP_FAILURE, "talloc failed.\n"); 80 ret = ENOMEM; 81 goto done; 82 } 83 state->io->write_to_child_fd = -1; 84 state->io->read_from_child_fd = -1; 85 talloc_set_destructor((void *) state->io, child_io_destructor); Can you show more context from an strace run that captures some of the leaks? Can you also paste the output of "lsof -E -p $(pidof sssd_ssh)" that matches the strace? The -E argument should print some more useful info about the pipe.
Created attachment 1471881 [details] strace -f -p $(pidof sssd_ssh) -s 512 -o /tmp/sssd_ssh.strace I'm not seeing any more useful info from lsof -E, but here it is: sssd_ssh 1911 root 22u unix 0x00000000470574d6 0t0 548765 /var/lib/sss/pipes/ssh type=STREAM sssd_ssh 1911 root 24r FIFO 0,12 0t0 559448 pipe sssd_ssh 1911 root 25r FIFO 0,12 0t0 558814 pipe sssd_ssh 1911 root 27w FIFO 0,12 0t0 559449 pipe sssd_ssh 1911 root 29w FIFO 0,12 0t0 558815 pipe strace attached I ran: /usr/bin/sss_ssh_authorizedkeys orion to reproduce the issue.
(In reply to Jakub Hrozek from comment #6) > I'm sorry, but so far I can't reproduce this issue. Code-wise the pipes to > the p11 child should be closed using a destructor: > > 77 state->io = talloc(state, struct child_io_fds); > 78 if (state->io == NULL) { > 79 DEBUG(SSSDBG_OP_FAILURE, "talloc failed.\n"); > 80 ret = ENOMEM; > 81 goto done; > 82 } > 83 state->io->write_to_child_fd = -1; > 84 state->io->read_from_child_fd = -1; > 85 talloc_set_destructor((void *) state->io, child_io_destructor); > > Can you show more context from an strace run that captures some of the > leaks? Can you also paste the output of "lsof -E -p $(pidof sssd_ssh)" that > matches the strace? The -E argument should print some more useful info about > the pipe. Here is a patch https://pagure.io/SSSD/sssd/pull-request/3793 And IIRC Orion already tried that. BTW it was introduced in 1.16.2
(In reply to Lukas Slebodnik from comment #8) > (In reply to Jakub Hrozek from comment #6) > > I'm sorry, but so far I can't reproduce this issue. Code-wise the pipes to > > the p11 child should be closed using a destructor: > > > > 77 state->io = talloc(state, struct child_io_fds); > > 78 if (state->io == NULL) { > > 79 DEBUG(SSSDBG_OP_FAILURE, "talloc failed.\n"); > > 80 ret = ENOMEM; > > 81 goto done; > > 82 } > > 83 state->io->write_to_child_fd = -1; > > 84 state->io->read_from_child_fd = -1; > > 85 talloc_set_destructor((void *) state->io, child_io_destructor); > > > > Can you show more context from an strace run that captures some of the > > leaks? Can you also paste the output of "lsof -E -p $(pidof sssd_ssh)" that > > matches the strace? The -E argument should print some more useful info about > > the pipe. > > Here is a patch https://pagure.io/SSSD/sssd/pull-request/3793 > And IIRC Orion already tried that. > > BTW it was introduced in 1.16.2 Thank you for the patch. Because I had no idea you are also working at the issue, I also wrote a patch, just different, which still uses the destructors. We can discuss which one we should use in the PR.
master: a76f96ac143128c11bdb975293d667aca861cd91
Upstream ticket: https://pagure.io/SSSD/sssd/issue/3794