Bug 2072050

Summary: sssd_nss exiting (due to missing 'sssd' local user) making SSSD service to restart in a loop
Product: Red Hat Enterprise Linux 8 Reporter: Micah Abbott <miabbott>
Component: sssdAssignee: Alexey Tikhonov <atikhono>
Status: CLOSED ERRATA QA Contact: shridhar <sgadekar>
Severity: high Docs Contact:
Priority: high    
Version: 8.6CC: aboscatt, atikhono, bleanhar, dgoodwin, grajaiya, hhei, jdelft, jhrozek, jlebon, lslebodn, lwan, miabbott, mzidek, pbrezina, rpittau, sgadekar, toneata, travier, tscherf
Target Milestone: rcKeywords: Triaged, ZStream
Target Release: ---Flags: sgadekar: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: sync-to-jira
Fixed In Version: sssd-2.7.0-2.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2074648 (view as bug list) Environment:
Last Closed: 2022-11-08 10:51:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2074648    
Attachments:
Description Flags
sssd.log
none
sssd_nss.log
none
sssd_implicit_files.log none

Description Micah Abbott 2022-04-05 13:48:49 UTC
Created attachment 1870862 [details]
sssd.log

As part of OpenShift 4.11, RHCOS recently switched to using the RHEL 8.6 Beta content and we started to see evidence of `sssd` crashing and restarting in a loop.

```
Apr 04 22:10:22 test1-4nhln-master-0 systemd[1]: sssd.service: Service RestartSec=100ms expired, scheduling restart.
Apr 04 22:10:22 test1-4nhln-master-0 systemd[1]: sssd.service: Scheduled restart job, restart counter is at 26.
Apr 04 22:10:22 test1-4nhln-master-0 systemd[1]: Stopped System Security Services Daemon.
Apr 04 22:10:22 test1-4nhln-master-0 systemd[1]: sssd.service: Consumed 532ms CPU time
Apr 04 22:10:22 test1-4nhln-master-0 systemd[1]: Starting System Security Services Daemon...
Apr 04 22:10:22 test1-4nhln-master-0 sssd[2615]: Starting up
Apr 04 22:10:22 test1-4nhln-master-0 sssd_be[2616]: Starting up
Apr 04 22:10:22 test1-4nhln-master-0 sssd_nss[2617]: Starting up
Apr 04 22:10:22 test1-4nhln-master-0 sssd_nss[2624]: Starting up
Apr 04 22:10:25 test1-4nhln-master-0 sssd_nss[2638]: Starting up
Apr 04 22:10:29 test1-4nhln-master-0 sssd_nss[2681]: Starting up
Apr 04 22:10:29 test1-4nhln-master-0 sssd[2615]: Exiting the SSSD. Could not restart critical service [nss].
Apr 04 22:10:29 test1-4nhln-master-0 systemd[1]: sssd.service: Main process exited, code=exited, status=1/FAILURE
Apr 04 22:10:29 test1-4nhln-master-0 systemd[1]: sssd.service: Failed with result 'exit-code'.
Apr 04 22:10:29 test1-4nhln-master-0 systemd[1]: Failed to start System Security Services Daemon.
Apr 04 22:10:29 test1-4nhln-master-0 systemd[1]: sssd.service: Consumed 510ms CPU time
Apr 04 22:10:29 test1-4nhln-master-0 systemd[1]: sssd.service: Service RestartSec=100ms expired, scheduling restart.
Apr 04 22:10:29 test1-4nhln-master-0 systemd[1]: sssd.service: Scheduled restart job, restart counter is at 27.
Apr 04 22:10:29 test1-4nhln-master-0 systemd[1]: Stopped System Security Services Daemon.
Apr 04 22:10:29 test1-4nhln-master-0 systemd[1]: sssd.service: Consumed 510ms CPU time
Apr 04 22:10:29 test1-4nhln-master-0 systemd[1]: Starting System Security Services Daemon...
Apr 04 22:10:29 test1-4nhln-master-0 sssd[2683]: Starting up
Apr 04 22:10:29 test1-4nhln-master-0 sssd_be[2684]: Starting up
Apr 04 22:10:29 test1-4nhln-master-0 sssd_nss[2685]: Starting up
Apr 04 22:10:29 test1-4nhln-master-0 sssd_nss[2686]: Starting up
```

The version of `sssd` used in RHCOS is `sssd-0-2.6.2-3.el8-x86_64`


We think this may be related to:

https://bugzilla.redhat.com/show_bug.cgi?id=1796466#c10
https://github.com/SSSD/sssd/issues/5753


The upstream PR:

https://github.com/SSSD/sssd/pull/6075

...may resolve this issue for us.


This is currently blocking the ability for OpenShift clusters to be installed/started successfully.

Comment 1 Micah Abbott 2022-04-05 13:50:10 UTC
Created attachment 1870873 [details]
sssd_nss.log

Comment 2 Alexey Tikhonov 2022-04-05 14:13:41 UTC
(In reply to Micah Abbott from comment #0)
> 
> https://github.com/SSSD/sssd/pull/6075
> 
> ...may resolve this issue for us.

This may "hide" issue, but not resolve it.

Do you have a custom sssd.conf? Could you please share content of /etc/sssd/* and also /var/log/sssd.log?

Comment 6 Micah Abbott 2022-04-05 14:53:05 UTC
This problem is being reproduced in an ephemeral CI cluster, so access to the nodes is difficult as they are torn down after failure.

However, looking at a single RHCOS node (outside of the cluster), the contents of `/etc/sssd/`:


```
$ sudo ls -lR /etc/sssd/
/etc/sssd/:
total 0
drwx--x--x. 2 sssd sssd 6 Apr  5 14:31 conf.d
drwx--x--x. 2 root root 6 Apr  5 14:31 pki

/etc/sssd/conf.d:
total 0

/etc/sssd/pki:
total 0
```

Comment 7 Alexey Tikhonov 2022-04-05 16:09:37 UTC
> [sss_user_by_name_or_uid] (0x0040): [sssd] is neither a valid UID nor a user name which could be resolved by getpwnam()

This happens in `nss_process_init()`->`sssd_supplementary_group()`->`sss_user_by_name_or_uid(SSSD_USER)`:
https://github.com/SSSD/sssd/blob/d1bce130f590e7e81a8472b8c9804ebe63898852/src/responder/nss/nsssrv.c#L404

For RHEL8 SSSD is configured `--with-sssd-user=sssd`

`%pre ipa, krb5-common, common, proxy` sections of a RHEL8 spec-file create this local user.
But it seems this user is missing on your host. Could you please confirm this?

What package is used by RHCOS? Is it the same as for RHEL? How is it installed?

Comment 8 Micah Abbott 2022-04-05 16:41:20 UTC
That would explain it.

```
$ cat /etc/passwd
root:x:0:0:root:/root:/bin/bash
core:x:1000:1000:CoreOS Admin:/var/home/core:/bin/bash
containers:x:1001:995:User for housing the sub ID range for containers:/var/home/containers:/sbin/nologin
```

We are including sssd-2.6.2-3.el8.x86_64 as part of RHCOS.

But the RPM scriptlets are being run as part of the `rpm-ostree compose` process on the server side as part of build process, so that might be where the disconnect is.

We previously haven't encountered this issue with older versions of `sssd`.  We moved from RHEL 8.4 EUS to RHEL 8.6 Beta when this problem showed up.

Specifically, `sssd 2.5.2-2.el8_5.4 -> 2.6.2-3.el8`

Let me reach out to more folks in the CoreOS team about this.

Comment 9 Jonathan Lebon 2022-04-05 17:03:37 UTC
RHCOS uses nss-altfiles to separate out system users into /usr/lib from local users in /etc. Looking at an RHCOS 8.6 pipeline build, we do have the sssd user and group:

[root@cosa-devsh ~]# grep sssd /usr/lib/passwd
sssd:x:995:993:User for sssd:/:/sbin/nologin
[root@cosa-devsh ~]# grep sssd /usr/lib/group
sssd:x:993:
[root@cosa-devsh ~]# getent passwd sssd
sssd:x:995:993:User for sssd:/:/sbin/nologin
[root@cosa-devsh ~]# getent group sssd
sssd:x:993:

Comment 11 Alexey Tikhonov 2022-04-05 17:04:33 UTC
Btw, take a note, that if the only SSSD domain backend running is 'implicit_files' (`sssd_be --domain implicit_files`) than an option could be to just disable SSSD by default (https://access.redhat.com/solutions/6815101), leaving the option to configure it explicitly if network identities are needed on the node.

Comment 12 Alexey Tikhonov 2022-04-05 17:30:23 UTC
(In reply to Jonathan Lebon from comment #9)
> RHCOS uses nss-altfiles to separate out system users into /usr/lib from
> local users in /etc. Looking at an RHCOS 8.6 pipeline build, we do have the
> sssd user and group:
> 
> [root@cosa-devsh ~]# grep sssd /usr/lib/passwd
> sssd:x:995:993:User for sssd:/:/sbin/nologin


Ah, this explains:

(In reply to Micah Abbott from comment #8)
> 
> We previously haven't encountered this issue with older versions of `sssd`. 
> We moved from RHEL 8.4 EUS to RHEL 8.6 Beta when this problem showed up.
> 
> Specifically, `sssd 2.5.2-2.el8_5.4 -> 2.6.2-3.el8`


The reason is https://github.com/SSSD/sssd/pull/5867  --  it was released upstream in sssd-2.6.2

Comment 21 Timothée Ravier 2022-04-07 16:03:42 UTC
AFAIU, upstream did not spot this one due to this change: https://fedoraproject.org/wiki/Changes/FlexibleLocalUserCache

Comment 25 Alexey Tikhonov 2022-04-12 14:44:13 UTC
Upstream PR: https://github.com/SSSD/sssd/pull/6108

Comment 28 Alexey Tikhonov 2022-04-14 09:39:40 UTC
Pushed PR: https://github.com/SSSD/sssd/pull/6108

* `master`
    * 3c6218aa91026e066e793ee26333ea64fd6bc50e - Revert "man: sssd.conf and sssd-ifp clarify user option"
    * 37f90057792a0b4543f34684ed9a240fe8e869c1 - Revert "usertools: force local user for sssd process user"

Comment 33 HuijingHei 2022-05-30 06:38:51 UTC
Maybe it is missing Verified:Tested flag. 
@shridhar, could you help to do some testing based on the new build? And I can help to test from my side.

Comment 42 errata-xmlrpc 2022-11-08 10:51:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (sssd bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:7739