Bug 2149091
| Summary: | Update to sssd-2.7.3-4.el8_7.1.x86_64 resulted in "Request to sssd failed. Device or resource busy" | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Danish Shaikh <dshaikh> | ||||
| Component: | sssd | Assignee: | Alexey Tikhonov <atikhono> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Anuj Borah <aborah> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 8.7 | CC: | aborah, aboscatt, abroy, aogburn, asharov, atikhono, christophe.viallon, ekeck, jwooten, k.w.mulderij, nsuryawa, pbrezina, R.Eggermont, sbose, shobbs, tscherf | ||||
| Target Milestone: | rc | Keywords: | Triaged, ZStream | ||||
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | sync-to-jira | ||||||
| Fixed In Version: | sssd-2.8.2-2.el8 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 2172240 (view as bug list) | Environment: | |||||
| Last Closed: | 2023-05-16 09:08:11 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 2172240 | ||||||
| Deadline: | 2023-02-13 | ||||||
| Attachments: |
|
||||||
|
Description
Danish Shaikh
2022-11-28 19:30:53 UTC
Confirmed for me, exact same issue. In our case, this is happening under "high"(well very relative) authentication request rate using pam_sss to sssd authentication method (in fact Basic authentication of apache , redirected to pam , then sssd ...) . As already said, i confirm that it is solve by downgrading to sssd 2.6.2-4. As of the 25 of january 2023, this bogue daesn't seem to be solve by any sssd update. Very annoying. No problems seen with sudo, or ssh access. I have noted some other errors sometimes like "bad address" in place of "Device buzy" in /var/log/secure or "journalctl -u httpd". I rised the debug_level to 10 in [pam] section of sssd and could get the following logs, that seems related to the already mentioned "pam_sss(pfmenu:auth): Request to sssd failed. Device or resource busy" in /var/log/secure or journalctl logs ... My versions (Not working) : sssd version 2.7.3.4 sssd-pam idem (inclu dans sssd) mod_authnz_pam 1.1.0.7 pam-1.3.1-22.el8.x86_64 My old (still working) versions : sssd-2.6.2-4.el8_6.1.x86_64 pam-1.3.1-16.el8.x86_64 Logs (debug_level =10) for sssd_pam responder: (2023-01-20 20:34:18): [pam] [ldb] (0x10000): [CID#61] Added timed event "ldb_kv_callback": 0x5638973115f0 (2023-01-20 20:34:18): [pam] [ldb] (0x10000): [CID#61] Added timed event "ldb_kv_timeout": 0x5638972f7710 (2023-01-20 20:34:18): [pam] [ldb] (0x10000): [CID#61] Running timer event 0x5638973115f0 "ldb_kv_callback" (2023-01-20 20:34:18): [pam] [ldb] (0x10000): [CID#61] Destroying timer event 0x5638972f7710 "ldb_kv_timeout" (2023-01-20 20:34:18): [pam] [ldb] (0x10000): [CID#61] Destroying timer event 0x5638973115f0 "ldb_kv_callback" (2023-01-20 20:34:18): [pam] [filter_responses_env] (0x4000): [CID#61] Found PAM ENV filter for variable [KRB5CCNAME] and service [sudo]. (2023-01-20 20:34:18): [pam] [filter_responses_env] (0x4000): [CID#61] Found PAM ENV filter for variable [KRB5CCNAME] and service [sudo-i]. (2023-01-20 20:34:18): [pam] [pam_reply] (0x0200): [CID#61] blen: 67 (2023-01-20 20:34:18): [pam] [pam_reply] (0x0200): [CID#61] Returning [0]: Succès to the client ---- (2023-01-20 20:34:18): [pam] [client_recv] (0x0400): [CID#61] Failed to read request, aborting client! ---- (2023-01-20 20:34:18): [pam] [client_close_fn] (0x2000): [CID#61] Terminated client [0x563897338210][24] (2023-01-20 20:34:18): [pam] [get_client_cred] (0x4000): Client [0x563897347050][24] creds: euid[48] egid[48] pid[10914] cmd_line['/usr/sbin/httpd']. (2023-01-20 20:34:18): [pam] [setup_client_idle_timer] (0x4000): Idle timer re-set for client [0x563897347050][24] (2023-01-20 20:34:18): [pam] [accept_fd_handler] (0x0400): [CID#62] Client [cmd /usr/sbin/httpd][uid 48][0x563897347050][24] connected! (2023-01-20 20:34:18): [pam] [sss_cmd_get_version] (0x0200): [CID#62] Received client version [3]. (2023-01-20 20:34:18): [pam] [sss_cmd_get_version] (0x0200): [CID#62] Offered version [3]. (2023-01-20 20:34:18): [pam] [pam_cmd_preauth] (0x0100): [CID#62] entering pam_cmd_preauth ... I hope that it helps Hi , just for information we tried to downgrade sssd to last 2.6.4 version without success (more than 170 pacquets needed , and gives a lot of incompatibility problemes). Any date for the fix ? We are stucked in our project developpment because of this bug that let pam/sssd use impossible! Please rise it's priority if possible (In reply to christophe.viallon from comment #21) > > Any date for the fix ? At the moment I have no other ideas besides building "intermediate" versions of SSSD between 2.6.2-4.el8_6.1 and sssd-2.7.3-4.el8_7.1 in hope to catch a patch that introduced an issue. > Please rise it's priority if possible Would you be willing to test a number of test builds? Might be alleviated by https://github.com/SSSD/sssd/pull/6560 (In reply to Alexey Tikhonov from comment #22) > (In reply to christophe.viallon from comment #21) > > > > Any date for the fix ? > > At the moment I have no other ideas besides building "intermediate" versions > of SSSD between 2.6.2-4.el8_6.1 and sssd-2.7.3-4.el8_7.1 in hope to catch a > patch that introduced an issue. > > > > Please rise it's priority if possible > > Would you be willing to test a number of test builds? Sorry for this delay, probably quite late. I would like to help on this but i will be unable to find necessary time. I saw that you had some pregess it seems anyway. Inspecting strace of `httpd`:
ID=1665912 starts multiple threads, including 1665914, 1665915, 1665916
* 1665914:
...
22:13:05 stat("/var/lib/sss/pipes/pam", {st_mode=S_IFSOCK|0666, st_size=0, ...}) = 0
22:13:05 getpid() = 1665871
22:13:05 fstat(-1, 0x7f3797608420) = -1 EBADF (Bad file descriptor)
22:13:05 getpid() = 1665871
22:13:05 socket(AF_UNIX, SOCK_STREAM, 0) = 18
22:13:05 fcntl(18, F_GETFL) = 0x2 (flags O_RDWR)
22:13:05 fcntl(18, F_SETFL, O_RDWR|O_NONBLOCK) = 0
22:13:05 fcntl(18, F_GETFD) = 0
22:13:05 fcntl(18, F_SETFD, FD_CLOEXEC) = 0
22:13:05 connect(18, {sa_family=AF_UNIX, sun_path="/var/lib/sss/pipes/pam"}, 110) = 0
22:13:05 fstat(18, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
22:13:05 poll([{fd=18, events=POLLOUT}], 1, 300000) = 1 ([{fd=18, revents=POLLOUT}])
22:13:05 sendto(18, "...", 16, MSG_NOSIGNAL, NULL, 0) = 16
...
22:13:05 close(18) = 0 <-- probably pam_end()
* 1665915: the same as in 1665914 and also at 22:13:05
* 1665916:
...
22:13:05 stat("/var/lib/sss/pipes/pam", {st_mode=S_IFSOCK|0666, st_size=0, ...}) = 0
22:13:05 getpid() = 1665871
22:13:05 poll([{fd=18, events=POLLIN|POLLOUT}], 1, 300000) = 1 ([{fd=18, revents=POLLOUT}]) <-- sss_cli_check_socket()
22:13:05 getsockopt(18, SOL_SOCKET, SO_PEERCRED, {pid=3600209, uid=0, gid=0}, [12]) = 0 <-- check_server_cred(), so this is sss_pam_make_request()
22:13:05 poll([{fd=18, events=POLLOUT}], 1, 300000) = 1 ([{fd=18, revents=POLLOUT}])
22:13:05 sendto(18, "...", 16, MSG_NOSIGNAL, NULL, 0) = 16
22:13:05 poll([{fd=18, events=POLLOUT}], 1, 300000) = 1 ([{fd=18, revents=POLLOUT}])
22:13:05 sendto(18, "..."..., 121, MSG_NOSIGNAL, NULL, 0) = 121
22:13:05 poll([{fd=18, events=POLLIN}], 1, 300000) = 1 ([{fd=18, revents=POLLIN}])
22:13:05 read(18, "...", 16) = 16
22:13:05 poll([{fd=18, events=POLLIN}], 1, 300000) = 1 ([{fd=18, revents=POLLIN}])
22:13:05 read(18, "..."..., 80) = 80
...
22:13:05 stat("/var/lib/sss/pipes/pam", {st_mode=S_IFSOCK|0666, st_size=0, ...}) = 0
22:13:05 getpid() = 1665871
22:13:05 poll([{fd=18, events=POLLIN|POLLOUT}], 1, 300000) = 1 ([{fd=18, revents=POLLOUT}]) <-- sss_cli_check_socket()
22:13:05 getsockopt(18, SOL_SOCKET, SO_PEERCRED, {pid=3600209, uid=0, gid=0}, [12]) = 0 <-- check_server_cred()
22:13:05 poll([{fd=18, events=POLLOUT}], 1, 300000) = 1 ([{fd=18, revents=POLLOUT}])
22:13:05 sendto(18, "...", 16, MSG_NOSIGNAL, NULL, 0) = 16
22:13:05 poll([{fd=18, events=POLLOUT}], 1, 300000) = 1 ([{fd=18, revents=POLLOUT}])
22:13:05 sendto(18, "..."..., 88, MSG_NOSIGNAL, NULL, 0) = 88
22:13:05 poll([{fd=18, events=POLLIN}], 1, 300000) = 1 ([{fd=18, revents=POLLNVAL}]) <-- POLLNVAL: Invalid request: fd not open
22:13:05 futex(0x7f37946030c0, FUTEX_WAKE_PRIVATE, 1) = 0
22:13:05 sendto(19, "<83>Feb 7 22:13:05 httpd: pam_s"..., 99, MSG_NOSIGNAL, NULL, 0) = 99
22:13:05 socket(AF_NETLINK, SOCK_RAW, NETLINK_AUDIT) = 18
So id=1665916 clearly re-uses socket fd open by either 1665914 or 1665915, and then it got closed mid pam transaction.
Most probably bug was introduced in https://github.com/SSSD/sssd/commit/c7918bef57df57fa8c9226fadf7e29e272d1fbff , where locking sss_pam_close_fd() was replaced with lockless sss_cli_close_socket().
Besides, I think sss_client common code should *not* close socket in case of POLLNVAL (otherwise there is a risk to close "foreign" socket if fd was re-used by a different thread).
(In reply to Alexey Tikhonov from comment #32) > > Most probably bug was introduced in https://github.com/SSSD/sssd/commit/c7918bef57df57fa8c9226fadf7e29e272d1fbff , > where locking sss_pam_close_fd() was replaced with lockless sss_cli_close_socket(). JFTR: this ^^ commit didn't introduce a bug per se (but rather paved a road to). At that point "lockfree client" feature was expected to be built unconditionally, every thread would always operate on its own socket and thus a lock before close() wasn't needed. But later feature support was conditioned by specific functionality in glibc, and thus feature build was disabled on some platforms, including RHEL8. Restoration of the lock for this case was missed. Pushed PR: https://github.com/SSSD/sssd/pull/6560 * `master` * bf3f73ea0ee123fe4e7c4bdd2287ac5a5e6d9082 - PAM_SSS: close(sss_cli_sd) should also be protected with mutex. Otherwise a thread calling pam_end() can close socket mid pam transaction in another thread. * ef93284b5a1f196425d9a61e8e24de8972240eb3 - SSS_CLIENT: if poll() returns POLLNVAL then socket is alredy closed (or wasn't open) so it shouldn't be closed again. Otherwise there is a risk to close "foreign" socket opened in another thread. * 0b8638d8de435384562f17d041655887b73523cd - SSS_CLIENT: fix error codes returned by common read/write/check helpers. * `sssd-2-8` * 1fd7a5ecb46a02a29ebf42039575b5344307bfbb - PAM_SSS: close(sss_cli_sd) should also be protected with mutex. Otherwise a thread calling pam_end() can close socket mid pam transaction in another thread. * a40b25a3af29706c058ce5a02dd0ba294dbb6874 - SSS_CLIENT: if poll() returns POLLNVAL then socket is alredy closed (or wasn't open) so it shouldn't be closed again. Otherwise there is a risk to close "foreign" socket opened in another thread. * f3333b9dbeda33a9344b458accaa4ff372adb660 - SSS_CLIENT: fix error codes returned by common read/write/check helpers. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (sssd bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:2986 |