Bug 1609382
Summary: | sssd_ssh runs out of file descriptors and stops working | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Orion Poplawski <orion> | ||||
Component: | sssd | Assignee: | Jakub Hrozek <jhrozek> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 28 | CC: | abokovoy, fidencio, jhrozek, lslebodn, mzidek, orion, pbrezina, rharwood, sbose, ssorce | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | sssd-1.16.3-1.fc28 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1610667 (view as bug list) | Environment: | |||||
Last Closed: | 2018-08-29 20:12:15 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1610667 | ||||||
Attachments: |
|
Description
Orion Poplawski
2018-07-27 20:14:54 UTC
Well, sssd is just calling fork(), so I think it's just a matter of cert_to_ssh_key_step not closing the needed FDs in the parent. (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [cache_req_search_send] (0x0400): CR #0: Returning [USER.com] from cache (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [cache_req_search_ncache_filter] (0x0400): CR #0: This request type does not support filtering result by negative cache (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [cache_req_create_and_add_result] (0x0400): CR #0: Found 1 entries in domain ad.nwra.com (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [cache_req_done] (0x0400): CR #0: Finished: Success (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [child_sig_handler] (0x1000): Waiting for child [4203]. (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [child_sig_handler] (0x0100): child [4203] finished successfully. (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [cert_to_ssh_key_done] (0x1000): Certificate [MIIXXX] is valid. (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [child_sig_handler] (0x1000): Waiting for child [4210]. (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [child_sig_handler] (0x0100): child [4210] finished successfully. (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [cert_to_ssh_key_done] (0x1000): Certificate [MIIXXX] is valid. (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [child_sig_handler] (0x1000): Waiting for child [4212]. (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [child_sig_handler] (0x0100): child [4212] finished successfully. (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [cert_to_ssh_key_done] (0x1000): Certificate [MIIXXX] is valid. (Fri Jul 27 13:29:53 2018) [sssd[ssh]] [client_recv] (0x0200): Client disconnected! I suspect this check is not right: done: if (ret != EOK) { PIPE_CLOSE(pipefd_from_child); PIPE_CLOSE(pipefd_to_child); } After the pipes are created, you are always going to want to close them regardless of other errors. I see other examples of code like this: src/providers/ad/ad_machine_pw_renewal.c:ad_machine_account_password_renewal_send() src/providers/be_dyndns.c:be_nsupdate_send() src/responder/pam/pamsrv_p11.c:pam_check_cert_send() src/util/cert/cert_common_p11_child.c:cert_to_ssh_key_step() but PIPE_CLOSE->PIPE_FD_CLOSE() has checks and so should be safe to call at any time. Okay, I'll try to stop making a fool of myself pretending I understand the code and let someone else figure this out... I'm sorry, but so far I can't reproduce this issue. Code-wise the pipes to the p11 child should be closed using a destructor: 77 state->io = talloc(state, struct child_io_fds); 78 if (state->io == NULL) { 79 DEBUG(SSSDBG_OP_FAILURE, "talloc failed.\n"); 80 ret = ENOMEM; 81 goto done; 82 } 83 state->io->write_to_child_fd = -1; 84 state->io->read_from_child_fd = -1; 85 talloc_set_destructor((void *) state->io, child_io_destructor); Can you show more context from an strace run that captures some of the leaks? Can you also paste the output of "lsof -E -p $(pidof sssd_ssh)" that matches the strace? The -E argument should print some more useful info about the pipe. Created attachment 1471881 [details]
strace -f -p $(pidof sssd_ssh) -s 512 -o /tmp/sssd_ssh.strace
I'm not seeing any more useful info from lsof -E, but here it is:
sssd_ssh 1911 root 22u unix 0x00000000470574d6 0t0 548765 /var/lib/sss/pipes/ssh type=STREAM
sssd_ssh 1911 root 24r FIFO 0,12 0t0 559448 pipe
sssd_ssh 1911 root 25r FIFO 0,12 0t0 558814 pipe
sssd_ssh 1911 root 27w FIFO 0,12 0t0 559449 pipe
sssd_ssh 1911 root 29w FIFO 0,12 0t0 558815 pipe
strace attached
I ran:
/usr/bin/sss_ssh_authorizedkeys orion
to reproduce the issue.
(In reply to Jakub Hrozek from comment #6) > I'm sorry, but so far I can't reproduce this issue. Code-wise the pipes to > the p11 child should be closed using a destructor: > > 77 state->io = talloc(state, struct child_io_fds); > 78 if (state->io == NULL) { > 79 DEBUG(SSSDBG_OP_FAILURE, "talloc failed.\n"); > 80 ret = ENOMEM; > 81 goto done; > 82 } > 83 state->io->write_to_child_fd = -1; > 84 state->io->read_from_child_fd = -1; > 85 talloc_set_destructor((void *) state->io, child_io_destructor); > > Can you show more context from an strace run that captures some of the > leaks? Can you also paste the output of "lsof -E -p $(pidof sssd_ssh)" that > matches the strace? The -E argument should print some more useful info about > the pipe. Here is a patch https://pagure.io/SSSD/sssd/pull-request/3793 And IIRC Orion already tried that. BTW it was introduced in 1.16.2 (In reply to Lukas Slebodnik from comment #8) > (In reply to Jakub Hrozek from comment #6) > > I'm sorry, but so far I can't reproduce this issue. Code-wise the pipes to > > the p11 child should be closed using a destructor: > > > > 77 state->io = talloc(state, struct child_io_fds); > > 78 if (state->io == NULL) { > > 79 DEBUG(SSSDBG_OP_FAILURE, "talloc failed.\n"); > > 80 ret = ENOMEM; > > 81 goto done; > > 82 } > > 83 state->io->write_to_child_fd = -1; > > 84 state->io->read_from_child_fd = -1; > > 85 talloc_set_destructor((void *) state->io, child_io_destructor); > > > > Can you show more context from an strace run that captures some of the > > leaks? Can you also paste the output of "lsof -E -p $(pidof sssd_ssh)" that > > matches the strace? The -E argument should print some more useful info about > > the pipe. > > Here is a patch https://pagure.io/SSSD/sssd/pull-request/3793 > And IIRC Orion already tried that. > > BTW it was introduced in 1.16.2 Thank you for the patch. Because I had no idea you are also working at the issue, I also wrote a patch, just different, which still uses the destructors. We can discuss which one we should use in the PR. master: a76f96ac143128c11bdb975293d667aca861cd91 Upstream ticket: https://pagure.io/SSSD/sssd/issue/3794 |