Bug 1631813
Summary: | LDAP back end: if parsing an entry fails, the whole back end goes offline | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Dave <dsimes> |
Component: | sssd | Assignee: | Paweł Poławski <ppolawsk> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Jakub Vavra <jvavra> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 8.2 | CC: | aboscatt, atikhono, dsimes, ekeck, grajaiya, jhrozek, lslebodn, mkosek, mzidek, pbrezina, pkettman, thalman, tscherf |
Target Milestone: | rc | Keywords: | Reopened, Triaged |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-09-21 20:30:08 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Dave
2018-09-21 15:38:18 UTC
Please provide logs and steps to reproduce. See https://docs.pagure.org/SSSD.sssd/users/troubleshooting.html and https://docs.pagure.org/SSSD.sssd/users/reporting_bugs.html https://access.redhat.com/support/cases/#/case/02107258 has sssd debug logs & sosreport from the customer incident (In reply to Dave from comment #3) > https://access.redhat.com/support/cases/#/case/02107258 > > has sssd debug logs & sosreport from the customer incident The latest debug logs in the case are from May. I briefly checked all three of them, but couldn't find any crash in the logs. Could you point me to the exact tarball and ideally also the timestamp of the crash? Since there was no reply for the needinfo in about a month and no new logs etc in the case, I'm closing the bug as INSUFFICIENT_DATA. Please reopen if you can provide the information requested in comment #4. (In reply to Jakub Hrozek from comment #4) > (In reply to Dave from comment #3) > > https://access.redhat.com/support/cases/#/case/02107258 > > > > has sssd debug logs & sosreport from the customer incident > > The latest debug logs in the case are from May. I briefly checked all three > of them, but couldn't find any crash in the logs. > > Could you point me to the exact tarball and ideally also the timestamp of > the crash? The case description shows the log file and timestamps: [root@ssc-prd-ipa-099 ~]# tail -60 /var/log/sssd/sssd_is.cl.ssa.gov.log (Fri May 25 08:44:03 2018) [sssd[be[is.cl.ssa.gov]]] [orderly_shutdown] (0x0010): SIGTERM: killing children I grabbed this attachment (but I believe they all have the file): sssd-debug-nsc-prd-ipa-099.is.cl.ssa.gov.02107258-20180525110543.tbz the sssd_is.cl.ssa.gov.log file has 4 of the crashes shown, if the following term is grep'd: SIGTERM: killing children additionally, if you look on page 2 of the case in the private messages/discussion, there is some further detail/analysis of the crash.. specifically these posts: PRIVATE MESSAGE (ASSOCIATE) Heverley, Alan on May 29 2018 at 04:42 PM -04:00 PRIVATE MESSAGE (ASSOCIATE) Stephenson, Justin on May 29 2018 at 05:00 PM -04:00 (they are rather long log output, or I would have posted them here) (In reply to Dave from comment #6) > (In reply to Jakub Hrozek from comment #4) > > (In reply to Dave from comment #3) > > > https://access.redhat.com/support/cases/#/case/02107258 > > > > > > has sssd debug logs & sosreport from the customer incident > > > > The latest debug logs in the case are from May. I briefly checked all three > > of them, but couldn't find any crash in the logs. > > > > Could you point me to the exact tarball and ideally also the timestamp of > > the crash? > > The case description shows the log file and timestamps: > > [root@ssc-prd-ipa-099 ~]# tail -60 /var/log/sssd/sssd_is.cl.ssa.gov.log > (Fri May 25 08:44:03 2018) [sssd[be[is.cl.ssa.gov]]] [orderly_shutdown] > (0x0010): SIGTERM: killing children > > I grabbed this attachment (but I believe they all have the file): > sssd-debug-nsc-prd-ipa-099.is.cl.ssa.gov.02107258-20180525110543.tbz > > the sssd_is.cl.ssa.gov.log file has 4 of the crashes shown, if the following > term is grep'd: > SIGTERM: killing children This is not a crash, but a graceful shutdown, SIGTERM is used as a signal from the main sssd process to any of the worker processes. if 'something' is not working for the customer, then these messages are more indicative of a problem: (Fri May 25 11:00:04 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline] But with bug report as vague as this, I don't know how to help further, sorry. It would be best to start with what is the problem the customer is seeing and document the problem with configuration and log files. (In reply to Jakub Hrozek from comment #9) > if 'something' is not working for the customer, then these messages are more > indicative of a problem: > > (Fri May 25 11:00:04 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data > Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline] > > But with bug report as vague as this, I don't know how to help further, > sorry. It would be best to start with what is the problem the customer is > seeing and document the problem with configuration and log files. The Red Hat customer was asking about the constant sssd_be crashes/restarts, (as mentioned in the case), that was not seen in their other production env (under 7.4, this new/problematic env was under 7.5). The issue was seen while attempting to troubleshoot kerberos logins (also in the case). Sorry I keep referring to the case, there is just so much info/detail there, it is alot to attempt to repaste all that into here (May 29 timeframe is a good source for all this detail/answers). > This is not a crash, but a graceful shutdown, SIGTERM is used as a signal from the main sssd process to any of the worker processes. so why is sssd constantly restarting sssd_be? but as mentioned in the case description, this bug had boiled down to this (not sure if this helps point closer to the issue - the customer pointed this out: did not happen in existing envs on 7.4, why does 7.5 have processes that keep crashing under a new/clean 7.5 install): Description of problem: sssd_be periodically crashing on 7.5 (IdM w/AD trust), did not occur on 7.4 (same IdM w/AD trust configuration) Actual results: periodic crashes from sssd_be Expected results: sssd_be does not crash Additional info: we later found out, that by adding these settings in sssd.conf, sssd_be would no longer crash (which is explained in the SSS performance tuning doc), but the customer did not have performance issues under 7.5, so these performance settings were never considered subdomain_inherit = ignore_group_members ignore_group_members = True I suppose we could have the customer bring up a 7.6 env (or upgrade existing 7.5 to 7.6?) and see if the issue still occurs? if there is not issue under 7.6, I would think we could mark it as 'resolved in a newer ver' kinda thing (In reply to Dave from comment #10) > (In reply to Jakub Hrozek from comment #9) > > if 'something' is not working for the customer, then these messages are more > > indicative of a problem: > > > > (Fri May 25 11:00:04 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data > > Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline] > > > > But with bug report as vague as this, I don't know how to help further, > > sorry. It would be best to start with what is the problem the customer is > > seeing and document the problem with configuration and log files. > > The Red Hat customer was asking about the constant sssd_be crashes/restarts, > (as mentioned in the case), that was not seen in their other production env > (under 7.4, this new/problematic env was under 7.5). The issue was seen > while attempting to troubleshoot kerberos logins (also in the case). Sorry I > keep referring to the case, there is just so much info/detail there, it is > alot to attempt to repaste all that into here (May 29 timeframe is a good > source for all this detail/answers). > Perhaps, but not all (not many?) engineers have access to customer cases, so unless information is included in a bug report, the bug report is only useful to a handful of engineers. The only useful comment from the May-29 timeframe is this one: (Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_asq_search_parse_entry] (0x2000): DN [CN=S137096,OU=DTSS,OU=OTSO,OU=DCS,OU=LAN Configs,DC=ba,DC=ad,DC=ssa,DC=gov] did not match the objectClass [group] (Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_process_result] (0x2000): Trace: sh[0x55e9138841c0], connected[1], ops[0x55e9150b8780], ldap[0x55e9137eb9c0] (Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_process_message] (0x4000): Message type: [LDAP_RES_SEARCH_ENTRY] (Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_asq_search_parse_entry] (0x0040): Unknown entry type, no objectClass found for DN [CN=S36503E,OU=OAC,OU=OIS,OU=DCS,OU=LAN Configs,DC=ba,DC=ad,DC=ssa,DC=gov]! (Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_get_generic_op_finished] (0x0020): reply parsing callback failed. (Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_op_destructor] (0x1000): Abandoning operation 1341 (Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [generic_ext_search_handler] (0x0040): sdap_get_generic_ext_recv failed [22]: Invalid argument (Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_deref_search_done] (0x0040): dereference processing failed [22]: Invalid argument (Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_nested_group_deref_direct_done] (0x0020): Error processing direct membership [22]: Invalid argument (Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_nested_done] (0x0020): Nested group processing failed: [22][Invalid argument] (Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_id_op_destroy] (0x4000): releasing operation connection (Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_id_op_done] (0x4000): releasing operation connection (Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [be_mark_dom_offline] (0x1000): Marking subdomain ba.ad.ssa.gov offline (Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [be_mark_subdom_offline] (0x1000): Marking subdomain ba.ad.ssa.gov as inactive (Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [ipa_srv_ad_acct_lookup_done] (0x0040): ipa_get_*_acct request failed: [1432158271]: Subdomain is inactive. (Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [ipa_subdomain_account_done] (0x0040): ipa_get_*_acct request failed: [1432158271]: Subdomain is inactive. This actually sounds like an issue where, if an entry has no objectclass (or rather I think the OC is not readable by the client), parsing the entry fails. And the error is propagated all the way above and marks the domain as 'inactive'. In absence of at least approximate timestamp of a failing login, it's hard to say if this is the root cause of the issue, but this is the only thing I could find in the sosreports. Except for this, this sosreport: sosreport-nsc-prd-ipa-099.is.cl.ssa.gov.02107258-20180529101400 does not have any useful sssd logs at all, there's only: (Sun May 27 06:00:01 2018) [sssd[be[is.cl.ssa.gov]]] [ipa_id_get_account_info_orig_done] (0x0080): Object not found, ending request (Sun May 27 06:00:01 2018) [sssd[be[is.cl.ssa.gov]]] [ipa_id_get_account_info_orig_done] (0x0080): Object not found, ending request (Sun May 27 06:07:34 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_id_conn_data_expire_handler] (0x0080): connection is about to expire, releasing it (Sun May 27 06:22:34 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_id_conn_data_expire_handler] (0x0080): connection is about to expire, releasing it repeated over and over this tarball: sssd-debug-nsc-prd-ipa-099.is.cl.ssa.gov.02107258-2018052511054 covers a period of about 4 days so the logs are in the size of gigabytes. Looking at errors, what stands out is what was in the case: (Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_asq_search_parse_entry] (0x2000): DN [CN=S2290EE,OU=SOC,OU=OIS,OU=DCS,OU=LAN Configs,DC=ba,DC=ad,DC=ssa,DC=gov] did not match the objectClass [group] (Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_process_result] (0x2000): Trace: sh[0x55e91382d860], connected[1], ops[0x55e91ec1e7a0], ldap[0x55e91381e310] (Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_process_message] (0x4000): Message type: [LDAP_RES_SEARCH_ENTRY] (Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_asq_search_parse_entry] (0x0040): Unknown entry type, no objectClass found for DN [CN=S36503E,OU=OAC,OU=OIS,OU=DCS,OU=LAN Configs,DC=ba,DC=ad,DC=ssa,DC=gov]! (Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_get_generic_op_finished] (0x0020): reply parsing callback failed. (Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_op_destructor] (0x1000): Abandoning operation 1246 (Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [generic_ext_search_handler] (0x0040): sdap_get_generic_ext_recv failed [22]: Invalid argument (Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_deref_search_done] (0x0040): dereference processing failed [22]: Invalid argument (Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_nested_group_deref_direct_done] (0x0020): Error processing direct membership [22]: Invalid argument (Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_nested_done] (0x0020): Nested group processing failed: [22][Invalid argument] (Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_id_op_destroy] (0x4000): releasing operation connection (Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_id_op_done] (0x4000): releasing operation connection (Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [be_mark_dom_offline] (0x1000): Marking subdomain ba.ad.ssa.gov offline (Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x55e9138fdcb0 (Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x55e913950600 (Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [ldb] (0x4000): Running timer event 0x55e9138fdcb0 "ltdb_callback" (Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [ldb] (0x4000): Destroying timer event 0x55e913950600 "ltdb_timeout" (Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [ldb] (0x4000): Ending timer event 0x55e9138fdcb0 "ltdb_callback" (Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [be_mark_subdom_offline] (0x1000): Marking subdomain ba.ad.ssa.gov as inactive The other thing that stands out is: (Thu May 24 09:56:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_idmap_sid_to_unix] (0x0040): Object SID [S-1-5-21-2516199605-17133447-3333846373-13221604] has a RID that is larger than the ldap_idmap_range_size. See the "ID MAPPING" section of sssd-ad(5) for an explanation of how to resolve this issue. (Thu May 24 09:56:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_idmap_sid_to_unix] (0x0080): Could not convert objectSID [S-1-5-21-2516199605-17133447-3333846373-13221604] to a UNIX ID (Thu May 24 09:56:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_save_group] (0x0080): Could not convert SID string: [Input/output error] (Thu May 24 09:56:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_save_group] (0x0080): Failed to save group [NY.Pinview.ssa.gov]: [Input/output error] (Thu May 24 09:56:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_save_groups] (0x0040): Failed to store group 0. Ignoring. pretty much what the error message says, if this is an IPA-AD trust environment, then the ranges are too small and need to be extended. So the parsing issue is a bug we should fix and I filed an upstream tracker: https://pagure.io/SSSD/sssd/issue/3908 > > This is not a crash, but a graceful shutdown, SIGTERM is used as a signal from the main sssd process to any of the worker processes. > > so why is sssd constantly restarting sssd_be? Because sssd itself received SIGTERM, according to: (Mon May 21 09:39:31 2018) [sssd] [monitor_quit_signal] (0x2000): Received shutdown command (Mon May 21 09:39:31 2018) [sssd] [monitor_quit_signal] (0x0040): Monitor received Terminated: terminating children (Mon May 21 09:39:31 2018) [sssd] [monitor_quit] (0x0040): Returned with: 0 This really looks like someone restarting the sssd service on purpose. That's what I meant about graceful shutdown. (In reply to Dave from comment #11) > but as mentioned in the case description, this bug had boiled down to this > (not sure if this helps point closer to the issue - the customer pointed > this out: did not happen in existing envs on 7.4, why does 7.5 have > processes that keep crashing under a new/clean 7.5 install): > > Description of problem: > sssd_be periodically crashing on 7.5 (IdM w/AD trust), did not occur on 7.4 > (same IdM w/AD trust configuration) > > Actual results: > periodic crashes from sssd_be > > Expected results: > sssd_be does not crash > > Additional info: > we later found out, that by adding these settings in sssd.conf, sssd_be > would no longer crash (which is explained in the SSS performance tuning > doc), but the customer did not have performance issues under 7.5, so these > performance settings were never considered > subdomain_inherit = ignore_group_members > ignore_group_members = True > > This and the parsing issue leads the to believe that there really is no crash, but a combination of time outs mitigated by the tuning (the time out would also mark the domain as 'inactive') and the earlier parsing issue. Both would mark the domain as inactive and unless there are pre-cached credentials, then logins would subsequently fail. > I suppose we could have the customer bring up a 7.6 env (or upgrade existing > 7.5 to 7.6?) and see if the issue still occurs? if there is not issue under > 7.6, I would think we could mark it as 'resolved in a newer ver' kinda thing Yes, it would be very nice to reproduce the issue and provide fresh logs. We could also see if the issue with the parsing is still there and if yes, we could provide a patch I guess. upstream is still open: https://pagure.io/SSSD/sssd/issue/3908 customer is tracking this bug via a support case Updated ticket to reflect the following: (In reply to Dave from comment #14) > upstream is still open: > https://pagure.io/SSSD/sssd/issue/3908 > > customer is tracking this bug via a support case I was evaluating this bug multiple times with no positive results. The amount of logs provided in this BZ is very limited. Because of this I decided to close this bug for now. If you disagree with the decision please reopen or open a new support case and create a new BZ. New upstream PR addressing this bug has been submitted: https://github.com/SSSD/sssd/pull/5881 |