1650685 – sssd nss fatal error initializing socket

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1650685 - sssd nss fatal error initializing socket

Summary: sssd nss fatal error initializing socket

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	sssd
Sub Component:
Version:	8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	8.0
Assignee:	Sumit Bose
QA Contact:	sssd-qe
Docs Contact:
URL:
Whiteboard:
Depends On:	1682305
Blocks:
TreeView+	depends on / blocked

Reported:	2018-11-16 20:39 UTC by Scott Poore
Modified:	2020-11-14 10:54 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-07-04 07:36:23 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
sssd logs (448.21 KB, application/x-gzip) 2018-11-16 20:44 UTC, Scott Poore	no flags	Details
View All

Description Scott Poore 2018-11-16 20:39:35 UTC

Description of problem:

Seeing a failure when trying to restart sssd:

[root@rhel8-2 sssd]# systemctl start sssd
Job for sssd.service failed because the control process exited with error code.
See "systemctl status sssd.service" and "journalctl -xe" for details.


From journal:

Nov 16 14:18:38 rhel8-2.example.com sssd[nss][22931]: Starting up
Nov 16 14:18:38 rhel8-2.example.com sssd[22902]: Exiting the SSSD. Could not restart critical service >
Nov 16 14:18:38 rhel8-2.example.com sssd[pac][22910]: Shutting down
Nov 16 14:18:38 rhel8-2.example.com sssd[sudo][22909]: Shutting down
Nov 16 14:18:38 rhel8-2.example.com sssd[ssh][22908]: Shutting down
Nov 16 14:18:38 rhel8-2.example.com sssd[ifp][22907]: Shutting down
Nov 16 14:18:38 rhel8-2.example.com sssd[pam][22906]: Shutting down
Nov 16 14:18:38 rhel8-2.example.com sssd[be[example.com]][22904]: Shutting down
Nov 16 14:18:38 rhel8-2.example.com sssd[be[implicit_files]][22903]: Shutting down
Nov 16 14:18:38 rhel8-2.example.com systemd[1]: sssd.service: Main process exited, code=exited, status>
Nov 16 14:18:38 rhel8-2.example.com systemd[1]: sssd.service: Failed with result 'exit-code'.
Nov 16 14:18:38 rhel8-2.example.com systemd[1]: Failed to start System Security Services Daemon.

sssd_nss.log:

(Fri Nov 16 14:20:57 2018) [sssd[nss]] [create_pipe_fd] (0x0010): Unable to bind on socket '/var/lib/sss/pipes/nss'
(Fri Nov 16 14:20:57 2018) [sssd[nss]] [activate_unix_sockets] (0x0010): Fatal error initializing sockets
(Fri Nov 16 14:20:57 2018) [sssd[nss]] [sss_process_init] (0x0010): fatal error initializing socket
(Fri Nov 16 14:20:57 2018) [sssd[nss]] [sss_responder_ctx_destructor] (0x0400): Responder is being shut down
(Fri Nov 16 14:20:57 2018) [sssd[nss]] [nss_process_init] (0x0010): sss_process_init() failed


Version-Release number of selected component (if applicable):
sssd-2.0.0-23.el8.x86_64


How reproducible:
Unknown but has happened randomly a few times during testing

Steps to Reproduce:
1.  setup sssd via ipa-client
2.  systemctl stop sssd
3.  rm -rf /var/lib/sss/{db,mc}/*
4.  systemctl start sssd



Actual results:

failing as shown above.  


Expected results:

sssd starts cleanly.


Additional info:

Comment 1 Scott Poore 2018-11-16 20:44:37 UTC

Created attachment 1506550 [details]
sssd logs

Comment 2 Jakub Hrozek 2018-11-19 08:27:10 UTC

This is literally the bind(2) syscall failing, this debug message:
(Fri Nov 16 14:20:57 2018) [sssd[nss]] [create_pipe_fd] (0x0010): Unable to bind on socket '/var/lib/sss/pipes/nss'

corresponds to:
 848 
 849     if (bind(fd, (struct sockaddr *)&addr, sizeof(addr)) == -1) {
 850         DEBUG(SSSDBG_FATAL_FAILURE,
 851               "Unable to bind on socket '%s'\n", sock_name);
 852         ret = EIO;
 853         goto done;
 854     }

At the very least, we should print errno to know why bind failed.

btw Scott, did the system maybe use socket activation for the nss responder (systemctl status sssd-nss.socket) ?

Comment 3 Scott Poore 2018-11-19 19:21:50 UTC

I don't think so.  I just booted the VM back up and it's not running:

[root@rhel8-2 ~]# systemctl status sssd-nss.socket
● sssd-nss.socket - SSSD NSS Service responder socket
   Loaded: loaded (/usr/lib/systemd/system/sssd-nss.socket; disabled; vendor preset: disabled)
   Active: inactive (dead)
     Docs: man:sssd.conf(5)
   Listen: /var/lib/sss/pipes/nss (Stream)

I don't know if it's related but, I was seeing the failure several times so I added a 5 second sleep after cache cleanup rm and before sssd start.

Also, I manually ran the following one time:

sss_cache -E
sssctl cache-remove


After those steps, I reran my tests and did not see this issue again.

Comment 4 Jakub Hrozek 2018-11-20 11:32:49 UTC

(In reply to Scott Poore from comment #3)
> I don't think so.  I just booted the VM back up and it's not running:
> 

Is it a local VM or a VM in beaker? I'm asking if it was possible to poke around..

If not, is the reproducer as simple as 'restart sssd in a loop' ?

In the meantime, I opened https://github.com/SSSD/sssd/pull/697 which should at least help with debugging.

> [root@rhel8-2 ~]# systemctl status sssd-nss.socket
> ● sssd-nss.socket - SSSD NSS Service responder socket
>    Loaded: loaded (/usr/lib/systemd/system/sssd-nss.socket; disabled; vendor
> preset: disabled)
>    Active: inactive (dead)
>      Docs: man:sssd.conf(5)
>    Listen: /var/lib/sss/pipes/nss (Stream)
> 

OK, then the socket is disable and stopped, so it's not the socket activation causing issues.

> I don't know if it's related but, I was seeing the failure several times so
> I added a 5 second sleep after cache cleanup rm and before sssd start.
> 
> Also, I manually ran the following one time:
> 
> sss_cache -E
> sssctl cache-remove
> 
> 
> After those steps, I reran my tests and did not see this issue again.

Comment 5 Scott Poore 2018-11-20 13:51:55 UTC

Local VM.  I can set it up so you can login though if needed.   restarting sssd in a loop might reproduce the issue.  The test automation would have restarted it many times before I hit this issue.

Let me know if you want me to setup env for you to take a look

Comment 6 Sumit Bose 2019-04-12 09:12:11 UTC

Hi Scott.

can you still reproduce the issue? I tried to restart SSSD in a loop for a long time but wasn't able to see a startup failure.

bye,
Sumit

Comment 7 Scott Poore 2019-04-15 15:43:46 UTC

I have not seen the issue again.  Maybe corrupt cache somewhere else?   Or fluke or fixed as a side effect of another bugfix.

This is the version I am trying now.

sssd-2.0.0-43.el8.x86_64


So, if you don't have any ideas about what could have done it, I guess we can close this.   I am running a loop but, I'm not expecting it to find anything.  It's already run through 30+ restarts and no issues.

Comment 8 Sumit Bose 2019-04-15 16:21:37 UTC

(In reply to Scott Poore from comment #7)
> I have not seen the issue again.  Maybe corrupt cache somewhere else?   Or
> fluke or fixed as a side effect of another bugfix.
> 
> This is the version I am trying now.
> 
> sssd-2.0.0-43.el8.x86_64
> 
> 
> So, if you don't have any ideas about what could have done it, I guess we
> can close this.   I am running a loop but, I'm not expecting it to find
> anything.  It's already run through 30+ restarts and no issues.

Hi Scott,

thanks for checking again. Imo there is no real issue here. Maybe in the run where you have seen the issue the restart was "too" fast and the kernel side of the socket wasn't properly cleaned up or something similar.

I think we can use the ticket to implement Jakub's suggestion and add errno and the related error message to the log messages. But I'm not sure if this is really relevant for RHEL. So as an alternative we can just clone this ticket to upstream and close it here?

bye,
Sumit

Comment 9 Scott Poore 2019-04-15 16:55:15 UTC

Sounds like a plan to me.

Comment 10 Jakub Hrozek 2019-07-04 07:36:23 UTC

Since this can't even be tested, I'm just going to close the bug as UPSTREAM.

The commit that added the better logging is 75696ddc84ab08c8c885dacc7796ebc8afc429ec

Note You need to log in before you can comment on or make changes to this bug.