Bug 1631813

Summary:	LDAP back end: if parsing an entry fails, the whole back end goes offline
Product:	Red Hat Enterprise Linux 8	Reporter:	Dave <dsimes>
Component:	sssd	Assignee:	Paweł Poławski <ppolawsk>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Jakub Vavra <jvavra>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	8.2	CC:	aboscatt, atikhono, dsimes, ekeck, grajaiya, jhrozek, lslebodn, mkosek, mzidek, pbrezina, pkettman, thalman, tscherf
Target Milestone:	rc	Keywords:	Reopened, Triaged
Target Release:	---	Flags:	pm-rhel: mirror+
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-09-21 20:30:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Dave 2018-09-21 15:38:18 UTC

Description of problem:
sssd_be periodically crashing on 7.5 (IdM w/AD trust), did not occur on 7.4 (same IdM w/AD trust configuration)

Version-Release number of selected component (if applicable):
packages have been upgraded several times since the initial issue was reported, currently they are at 7.5 latest:
kernel-3.10.0-862.11.6.el7.x86_64
sssd-1.16.0-19.el7_5.5.x86_64
ipa-server-4.5.4-10.el7_5.3.x86_64

How reproducible:
always

Steps to Reproduce:
1. install 7.5 system
2. install & configure ipa-server
3. set up AD trust

Actual results:
periodic crashes from sssd_be

Expected results:
sssd_be does not crash

Additional info:
we later found out, that by adding these settings in sssd.conf, sssd_be would no longer crash, but the customer did not have performance issues under 7.5, so these performance settings were never considered
subdomain_inherit = ignore_group_members
ignore_group_members = True

Comment 2 Jakub Hrozek 2018-09-21 15:49:17 UTC

Please provide logs and steps to reproduce.

See https://docs.pagure.org/SSSD.sssd/users/troubleshooting.html and https://docs.pagure.org/SSSD.sssd/users/reporting_bugs.html

Comment 3 Dave 2018-09-22 02:46:16 UTC

https://access.redhat.com/support/cases/#/case/02107258

has sssd debug logs & sosreport from the customer incident

Comment 4 Jakub Hrozek 2018-09-24 08:37:34 UTC

(In reply to Dave from comment #3)
> https://access.redhat.com/support/cases/#/case/02107258
> 
> has sssd debug logs & sosreport from the customer incident

The latest debug logs in the case are from May. I briefly checked all three of them, but couldn't find any crash in the logs.

Could you point me to the exact tarball and ideally also the timestamp of the crash?

Comment 5 Jakub Hrozek 2018-10-25 12:05:39 UTC

Since there was no reply for the needinfo in about a month and no new logs etc in the case, I'm closing the bug as INSUFFICIENT_DATA. Please reopen if you can provide the information requested in comment #4.

Comment 6 Dave 2018-12-07 18:50:20 UTC

(In reply to Jakub Hrozek from comment #4)
> (In reply to Dave from comment #3)
> > https://access.redhat.com/support/cases/#/case/02107258
> > 
> > has sssd debug logs & sosreport from the customer incident
> 
> The latest debug logs in the case are from May. I briefly checked all three
> of them, but couldn't find any crash in the logs.
> 
> Could you point me to the exact tarball and ideally also the timestamp of
> the crash?

The case description shows the log file and timestamps:

[root@ssc-prd-ipa-099 ~]# tail -60 /var/log/sssd/sssd_is.cl.ssa.gov.log
(Fri May 25 08:44:03 2018) [sssd[be[is.cl.ssa.gov]]] [orderly_shutdown] (0x0010): SIGTERM: killing children

I grabbed this attachment (but I believe they all have the file):
sssd-debug-nsc-prd-ipa-099.is.cl.ssa.gov.02107258-20180525110543.tbz

the sssd_is.cl.ssa.gov.log file has 4 of the crashes shown, if the following term is grep'd:
SIGTERM: killing children

Comment 7 Dave 2018-12-07 18:54:01 UTC

additionally, if you look on page 2 of the case in the private messages/discussion, there is some further detail/analysis of the crash.. specifically these posts:

PRIVATE MESSAGE (ASSOCIATE)
Heverley, Alan on May 29 2018 at 04:42 PM -04:00

PRIVATE MESSAGE (ASSOCIATE)
Stephenson, Justin on May 29 2018 at 05:00 PM -04:00

(they are rather long log output, or I would have posted them here)

Comment 8 Jakub Hrozek 2018-12-10 08:52:24 UTC

(In reply to Dave from comment #6)
> (In reply to Jakub Hrozek from comment #4)
> > (In reply to Dave from comment #3)
> > > https://access.redhat.com/support/cases/#/case/02107258
> > > 
> > > has sssd debug logs & sosreport from the customer incident
> > 
> > The latest debug logs in the case are from May. I briefly checked all three
> > of them, but couldn't find any crash in the logs.
> > 
> > Could you point me to the exact tarball and ideally also the timestamp of
> > the crash?
> 
> The case description shows the log file and timestamps:
> 
> [root@ssc-prd-ipa-099 ~]# tail -60 /var/log/sssd/sssd_is.cl.ssa.gov.log
> (Fri May 25 08:44:03 2018) [sssd[be[is.cl.ssa.gov]]] [orderly_shutdown]
> (0x0010): SIGTERM: killing children
> 
> I grabbed this attachment (but I believe they all have the file):
> sssd-debug-nsc-prd-ipa-099.is.cl.ssa.gov.02107258-20180525110543.tbz
> 
> the sssd_is.cl.ssa.gov.log file has 4 of the crashes shown, if the following
> term is grep'd:
> SIGTERM: killing children

This is not a crash, but a graceful shutdown, SIGTERM is used as a signal from the main sssd process to any of the worker processes.

Comment 9 Jakub Hrozek 2018-12-10 09:04:49 UTC

if 'something' is not working for the customer, then these messages are more indicative of a problem:

(Fri May 25 11:00:04 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]

But with bug report as vague as this, I don't know how to help further, sorry. It would be best to start with what is the problem the customer is seeing and document the problem with configuration and log files.

Comment 10 Dave 2018-12-17 14:20:31 UTC

(In reply to Jakub Hrozek from comment #9)
> if 'something' is not working for the customer, then these messages are more
> indicative of a problem:
> 
> (Fri May 25 11:00:04 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data
> Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
> 
> But with bug report as vague as this, I don't know how to help further,
> sorry. It would be best to start with what is the problem the customer is
> seeing and document the problem with configuration and log files.

The Red Hat customer was asking about the constant sssd_be crashes/restarts, (as mentioned in the case), that was not seen in their other production env (under 7.4, this new/problematic env was under 7.5). The issue was seen while attempting to troubleshoot kerberos logins (also in the case). Sorry I keep referring to the case, there is just so much info/detail there, it is alot to attempt to repaste all that into here (May 29 timeframe is a good source for all this detail/answers).

> This is not a crash, but a graceful shutdown, SIGTERM is used as a signal from the main sssd process to any of the worker processes.

so why is sssd constantly restarting sssd_be?

Comment 11 Dave 2018-12-17 14:28:12 UTC

but as mentioned in the case description, this bug had boiled down to this (not sure if this helps point closer to the issue - the customer pointed this out: did not happen in existing envs on 7.4, why does 7.5 have processes that keep crashing under a new/clean 7.5 install):

Description of problem:
sssd_be periodically crashing on 7.5 (IdM w/AD trust), did not occur on 7.4 (same IdM w/AD trust configuration)

Actual results:
periodic crashes from sssd_be

Expected results:
sssd_be does not crash

Additional info:
we later found out, that by adding these settings in sssd.conf, sssd_be would no longer crash (which is explained in the SSS performance tuning doc), but the customer did not have performance issues under 7.5, so these performance settings were never considered
subdomain_inherit = ignore_group_members
ignore_group_members = True


I suppose we could have the customer bring up a 7.6 env (or upgrade existing 7.5 to 7.6?) and see if the issue still occurs? if there is not issue under 7.6, I would think we could mark it as 'resolved in a newer ver' kinda thing

Comment 12 Jakub Hrozek 2018-12-19 09:10:25 UTC

(In reply to Dave from comment #10)
> (In reply to Jakub Hrozek from comment #9)
> > if 'something' is not working for the customer, then these messages are more
> > indicative of a problem:
> > 
> > (Fri May 25 11:00:04 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data
> > Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
> > 
> > But with bug report as vague as this, I don't know how to help further,
> > sorry. It would be best to start with what is the problem the customer is
> > seeing and document the problem with configuration and log files.
> 
> The Red Hat customer was asking about the constant sssd_be crashes/restarts,
> (as mentioned in the case), that was not seen in their other production env
> (under 7.4, this new/problematic env was under 7.5). The issue was seen
> while attempting to troubleshoot kerberos logins (also in the case). Sorry I
> keep referring to the case, there is just so much info/detail there, it is
> alot to attempt to repaste all that into here (May 29 timeframe is a good
> source for all this detail/answers).
> 

Perhaps, but not all (not many?) engineers have access to customer cases, so unless information is included in a bug report, the bug report is only useful to a handful of engineers.

The only useful comment from the May-29 timeframe is this one:
(Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_asq_search_parse_entry] (0x2000): DN [CN=S137096,OU=DTSS,OU=OTSO,OU=DCS,OU=LAN Configs,DC=ba,DC=ad,DC=ssa,DC=gov] did not match the objectClass [group]
(Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_process_result] (0x2000): Trace: sh[0x55e9138841c0], connected[1], ops[0x55e9150b8780], ldap[0x55e9137eb9c0]
(Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_process_message] (0x4000): Message type: [LDAP_RES_SEARCH_ENTRY]
(Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_asq_search_parse_entry] (0x0040): Unknown entry type, no objectClass found for DN [CN=S36503E,OU=OAC,OU=OIS,OU=DCS,OU=LAN Configs,DC=ba,DC=ad,DC=ssa,DC=gov]!
(Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_get_generic_op_finished] (0x0020): reply parsing callback failed.
(Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_op_destructor] (0x1000): Abandoning operation 1341
(Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [generic_ext_search_handler] (0x0040): sdap_get_generic_ext_recv failed [22]: Invalid argument
(Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_deref_search_done] (0x0040): dereference processing failed [22]: Invalid argument
(Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_nested_group_deref_direct_done] (0x0020): Error processing direct membership [22]: Invalid argument
(Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_nested_done] (0x0020): Nested group processing failed: [22][Invalid argument]
(Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_id_op_destroy] (0x4000): releasing operation connection
(Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_id_op_done] (0x4000): releasing operation connection
(Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [be_mark_dom_offline] (0x1000): Marking subdomain ba.ad.ssa.gov offline
(Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [be_mark_subdom_offline] (0x1000): Marking subdomain ba.ad.ssa.gov as inactive
(Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [ipa_srv_ad_acct_lookup_done] (0x0040): ipa_get_*_acct request failed: [1432158271]: Subdomain is inactive.
(Mon May 21 09:42:32 2018) [sssd[be[is.cl.ssa.gov]]] [ipa_subdomain_account_done] (0x0040): ipa_get_*_acct request failed: [1432158271]: Subdomain is inactive.

This actually sounds like an issue where, if an entry has no objectclass (or rather I think the OC is not readable by the client), parsing the entry fails. And the error is propagated all the way above and marks the domain as 'inactive'. In absence of at least approximate timestamp of a failing login, it's hard to say if this is the root cause of the issue, but this is the only thing I could find in the sosreports.

Except for this, this sosreport:
sosreport-nsc-prd-ipa-099.is.cl.ssa.gov.02107258-20180529101400
does not have any useful sssd logs at all, there's only:
(Sun May 27 06:00:01 2018) [sssd[be[is.cl.ssa.gov]]] [ipa_id_get_account_info_orig_done] (0x0080): Object not found, ending request                                                                                                          
(Sun May 27 06:00:01 2018) [sssd[be[is.cl.ssa.gov]]] [ipa_id_get_account_info_orig_done] (0x0080): Object not found, ending request                                                                                                          
(Sun May 27 06:07:34 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_id_conn_data_expire_handler] (0x0080): connection is about to expire, releasing it                                                                                                
(Sun May 27 06:22:34 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_id_conn_data_expire_handler] (0x0080): connection is about to expire, releasing it
repeated over and over

this tarball:
sssd-debug-nsc-prd-ipa-099.is.cl.ssa.gov.02107258-2018052511054
covers a period of about 4 days so the logs are in the size of gigabytes. Looking at errors, what stands out is what was in the case:
(Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_asq_search_parse_entry] (0x2000): DN [CN=S2290EE,OU=SOC,OU=OIS,OU=DCS,OU=LAN Configs,DC=ba,DC=ad,DC=ssa,DC=gov] did not match the objectClass [group]
(Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_process_result] (0x2000): Trace: sh[0x55e91382d860], connected[1], ops[0x55e91ec1e7a0], ldap[0x55e91381e310]
(Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_process_message] (0x4000): Message type: [LDAP_RES_SEARCH_ENTRY]
(Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_asq_search_parse_entry] (0x0040): Unknown entry type, no objectClass found for DN [CN=S36503E,OU=OAC,OU=OIS,OU=DCS,OU=LAN Configs,DC=ba,DC=ad,DC=ssa,DC=gov]!
(Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_get_generic_op_finished] (0x0020): reply parsing callback failed.
(Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_op_destructor] (0x1000): Abandoning operation 1246
(Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [generic_ext_search_handler] (0x0040): sdap_get_generic_ext_recv failed [22]: Invalid argument
(Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_deref_search_done] (0x0040): dereference processing failed [22]: Invalid argument
(Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_nested_group_deref_direct_done] (0x0020): Error processing direct membership [22]: Invalid argument
(Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_nested_done] (0x0020): Nested group processing failed: [22][Invalid argument]
(Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_id_op_destroy] (0x4000): releasing operation connection
(Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_id_op_done] (0x4000): releasing operation connection
(Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [be_mark_dom_offline] (0x1000): Marking subdomain ba.ad.ssa.gov offline
(Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x55e9138fdcb0

(Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x55e913950600

(Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [ldb] (0x4000): Running timer event 0x55e9138fdcb0 "ltdb_callback"

(Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [ldb] (0x4000): Destroying timer event 0x55e913950600 "ltdb_timeout"

(Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [ldb] (0x4000): Ending timer event 0x55e9138fdcb0 "ltdb_callback"

(Wed May 23 16:06:43 2018) [sssd[be[is.cl.ssa.gov]]] [be_mark_subdom_offline] (0x1000): Marking subdomain ba.ad.ssa.gov as inactive


The other thing that stands out is:
(Thu May 24 09:56:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_idmap_sid_to_unix] (0x0040): Object SID [S-1-5-21-2516199605-17133447-3333846373-13221604] has a RID that is larger than the ldap_idmap_range_size. See the "ID MAPPING" section of sssd-ad(5) for an explanation of how to resolve this issue.                                                                                                                                                                                
(Thu May 24 09:56:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_idmap_sid_to_unix] (0x0080): Could not convert objectSID [S-1-5-21-2516199605-17133447-3333846373-13221604] to a UNIX ID                                                         
(Thu May 24 09:56:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_save_group] (0x0080): Could not convert SID string: [Input/output error]                                                                                                         
(Thu May 24 09:56:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_save_group] (0x0080): Failed to save group [NY.Pinview.ssa.gov]: [Input/output error]                
(Thu May 24 09:56:32 2018) [sssd[be[is.cl.ssa.gov]]] [sdap_save_groups] (0x0040): Failed to store group 0. Ignoring.

pretty much what the error message says, if this is an IPA-AD trust environment, then the ranges are too small and need to be extended.

So the parsing issue is a bug we should fix and I filed an upstream tracker:
https://pagure.io/SSSD/sssd/issue/3908

> > This is not a crash, but a graceful shutdown, SIGTERM is used as a signal from the main sssd process to any of the worker processes.
> 
> so why is sssd constantly restarting sssd_be?

Because sssd itself received SIGTERM, according to:
(Mon May 21 09:39:31 2018) [sssd] [monitor_quit_signal] (0x2000): Received shutdown command                                                                                                                                                  
(Mon May 21 09:39:31 2018) [sssd] [monitor_quit_signal] (0x0040): Monitor received Terminated: terminating children                                                                                                                          
(Mon May 21 09:39:31 2018) [sssd] [monitor_quit] (0x0040): Returned with: 0
This really looks like someone restarting the sssd service on purpose. That's what I meant about graceful shutdown.

Comment 13 Jakub Hrozek 2018-12-19 09:13:11 UTC

(In reply to Dave from comment #11)
> but as mentioned in the case description, this bug had boiled down to this
> (not sure if this helps point closer to the issue - the customer pointed
> this out: did not happen in existing envs on 7.4, why does 7.5 have
> processes that keep crashing under a new/clean 7.5 install):
> 
> Description of problem:
> sssd_be periodically crashing on 7.5 (IdM w/AD trust), did not occur on 7.4
> (same IdM w/AD trust configuration)
> 
> Actual results:
> periodic crashes from sssd_be
> 
> Expected results:
> sssd_be does not crash
> 
> Additional info:
> we later found out, that by adding these settings in sssd.conf, sssd_be
> would no longer crash (which is explained in the SSS performance tuning
> doc), but the customer did not have performance issues under 7.5, so these
> performance settings were never considered
> subdomain_inherit = ignore_group_members
> ignore_group_members = True
> 
> 

This and the parsing issue leads the to believe that there really is no crash, but a combination of time outs mitigated by the tuning (the time out would also mark the domain as 'inactive') and the earlier parsing issue. Both would mark the domain as inactive and unless there are pre-cached credentials, then logins would subsequently fail.

> I suppose we could have the customer bring up a 7.6 env (or upgrade existing
> 7.5 to 7.6?) and see if the issue still occurs? if there is not issue under
> 7.6, I would think we could mark it as 'resolved in a newer ver' kinda thing

Yes, it would be very nice to reproduce the issue and provide fresh logs. We could also see if the issue with the parsing is still there and if yes, we could provide a patch I guess.

Comment 14 Dave 2019-12-17 16:11:28 UTC

upstream is still open:
https://pagure.io/SSSD/sssd/issue/3908

customer is tracking this bug via a support case

Comment 15 Alexey Tikhonov 2019-12-20 15:41:39 UTC

Updated ticket to reflect the following:

(In reply to Dave from comment #14)
> upstream is still open:
> https://pagure.io/SSSD/sssd/issue/3908
> 
> customer is tracking this bug via a support case

Comment 19 Paweł Poławski 2021-09-21 20:30:08 UTC

I was evaluating this bug multiple times with no positive results.
The amount of logs provided in this BZ is very limited.
Because of this I decided to close this bug for now.

If you disagree with the decision please reopen or open a new support case and create a new BZ.

Comment 20 Paweł Poławski 2021-11-18 10:11:47 UTC

New upstream PR addressing this bug has been submitted: https://github.com/SSSD/sssd/pull/5881