Bug 1650685 - sssd nss fatal error initializing socket
Summary: sssd nss fatal error initializing socket
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: sssd
Version: 8.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 8.0
Assignee: Sumit Bose
QA Contact: sssd-qe
URL:
Whiteboard:
Depends On: 1682305
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-16 20:39 UTC by Scott Poore
Modified: 2019-07-04 07:36 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-07-04 07:36:23 UTC
Type: Bug
Target Upstream Version:


Attachments (Terms of Use)
sssd logs (448.21 KB, application/x-gzip)
2018-11-16 20:44 UTC, Scott Poore
no flags Details

Description Scott Poore 2018-11-16 20:39:35 UTC
Description of problem:

Seeing a failure when trying to restart sssd:

[root@rhel8-2 sssd]# systemctl start sssd
Job for sssd.service failed because the control process exited with error code.
See "systemctl status sssd.service" and "journalctl -xe" for details.


From journal:

Nov 16 14:18:38 rhel8-2.example.com sssd[nss][22931]: Starting up
Nov 16 14:18:38 rhel8-2.example.com sssd[22902]: Exiting the SSSD. Could not restart critical service >
Nov 16 14:18:38 rhel8-2.example.com sssd[pac][22910]: Shutting down
Nov 16 14:18:38 rhel8-2.example.com sssd[sudo][22909]: Shutting down
Nov 16 14:18:38 rhel8-2.example.com sssd[ssh][22908]: Shutting down
Nov 16 14:18:38 rhel8-2.example.com sssd[ifp][22907]: Shutting down
Nov 16 14:18:38 rhel8-2.example.com sssd[pam][22906]: Shutting down
Nov 16 14:18:38 rhel8-2.example.com sssd[be[example.com]][22904]: Shutting down
Nov 16 14:18:38 rhel8-2.example.com sssd[be[implicit_files]][22903]: Shutting down
Nov 16 14:18:38 rhel8-2.example.com systemd[1]: sssd.service: Main process exited, code=exited, status>
Nov 16 14:18:38 rhel8-2.example.com systemd[1]: sssd.service: Failed with result 'exit-code'.
Nov 16 14:18:38 rhel8-2.example.com systemd[1]: Failed to start System Security Services Daemon.

sssd_nss.log:

(Fri Nov 16 14:20:57 2018) [sssd[nss]] [create_pipe_fd] (0x0010): Unable to bind on socket '/var/lib/sss/pipes/nss'
(Fri Nov 16 14:20:57 2018) [sssd[nss]] [activate_unix_sockets] (0x0010): Fatal error initializing sockets
(Fri Nov 16 14:20:57 2018) [sssd[nss]] [sss_process_init] (0x0010): fatal error initializing socket
(Fri Nov 16 14:20:57 2018) [sssd[nss]] [sss_responder_ctx_destructor] (0x0400): Responder is being shut down
(Fri Nov 16 14:20:57 2018) [sssd[nss]] [nss_process_init] (0x0010): sss_process_init() failed


Version-Release number of selected component (if applicable):
sssd-2.0.0-23.el8.x86_64


How reproducible:
Unknown but has happened randomly a few times during testing

Steps to Reproduce:
1.  setup sssd via ipa-client
2.  systemctl stop sssd
3.  rm -rf /var/lib/sss/{db,mc}/*
4.  systemctl start sssd



Actual results:

failing as shown above.  


Expected results:

sssd starts cleanly.


Additional info:

Comment 1 Scott Poore 2018-11-16 20:44:37 UTC
Created attachment 1506550 [details]
sssd logs

Comment 2 Jakub Hrozek 2018-11-19 08:27:10 UTC
This is literally the bind(2) syscall failing, this debug message:
(Fri Nov 16 14:20:57 2018) [sssd[nss]] [create_pipe_fd] (0x0010): Unable to bind on socket '/var/lib/sss/pipes/nss'

corresponds to:
 848 
 849     if (bind(fd, (struct sockaddr *)&addr, sizeof(addr)) == -1) {
 850         DEBUG(SSSDBG_FATAL_FAILURE,
 851               "Unable to bind on socket '%s'\n", sock_name);
 852         ret = EIO;
 853         goto done;
 854     }

At the very least, we should print errno to know why bind failed.

btw Scott, did the system maybe use socket activation for the nss responder (systemctl status sssd-nss.socket) ?

Comment 3 Scott Poore 2018-11-19 19:21:50 UTC
I don't think so.  I just booted the VM back up and it's not running:

[root@rhel8-2 ~]# systemctl status sssd-nss.socket
● sssd-nss.socket - SSSD NSS Service responder socket
   Loaded: loaded (/usr/lib/systemd/system/sssd-nss.socket; disabled; vendor preset: disabled)
   Active: inactive (dead)
     Docs: man:sssd.conf(5)
   Listen: /var/lib/sss/pipes/nss (Stream)

I don't know if it's related but, I was seeing the failure several times so I added a 5 second sleep after cache cleanup rm and before sssd start.

Also, I manually ran the following one time:

sss_cache -E
sssctl cache-remove


After those steps, I reran my tests and did not see this issue again.

Comment 4 Jakub Hrozek 2018-11-20 11:32:49 UTC
(In reply to Scott Poore from comment #3)
> I don't think so.  I just booted the VM back up and it's not running:
> 

Is it a local VM or a VM in beaker? I'm asking if it was possible to poke around..

If not, is the reproducer as simple as 'restart sssd in a loop' ?

In the meantime, I opened https://github.com/SSSD/sssd/pull/697 which should at least help with debugging.

> [root@rhel8-2 ~]# systemctl status sssd-nss.socket
> ● sssd-nss.socket - SSSD NSS Service responder socket
>    Loaded: loaded (/usr/lib/systemd/system/sssd-nss.socket; disabled; vendor
> preset: disabled)
>    Active: inactive (dead)
>      Docs: man:sssd.conf(5)
>    Listen: /var/lib/sss/pipes/nss (Stream)
> 

OK, then the socket is disable and stopped, so it's not the socket activation causing issues.

> I don't know if it's related but, I was seeing the failure several times so
> I added a 5 second sleep after cache cleanup rm and before sssd start.
> 
> Also, I manually ran the following one time:
> 
> sss_cache -E
> sssctl cache-remove
> 
> 
> After those steps, I reran my tests and did not see this issue again.

Comment 5 Scott Poore 2018-11-20 13:51:55 UTC
Local VM.  I can set it up so you can login though if needed.   restarting sssd in a loop might reproduce the issue.  The test automation would have restarted it many times before I hit this issue.

Let me know if you want me to setup env for you to take a look

Comment 6 Sumit Bose 2019-04-12 09:12:11 UTC
Hi Scott.

can you still reproduce the issue? I tried to restart SSSD in a loop for a long time but wasn't able to see a startup failure.

bye,
Sumit

Comment 7 Scott Poore 2019-04-15 15:43:46 UTC
I have not seen the issue again.  Maybe corrupt cache somewhere else?   Or fluke or fixed as a side effect of another bugfix.

This is the version I am trying now.

sssd-2.0.0-43.el8.x86_64


So, if you don't have any ideas about what could have done it, I guess we can close this.   I am running a loop but, I'm not expecting it to find anything.  It's already run through 30+ restarts and no issues.

Comment 8 Sumit Bose 2019-04-15 16:21:37 UTC
(In reply to Scott Poore from comment #7)
> I have not seen the issue again.  Maybe corrupt cache somewhere else?   Or
> fluke or fixed as a side effect of another bugfix.
> 
> This is the version I am trying now.
> 
> sssd-2.0.0-43.el8.x86_64
> 
> 
> So, if you don't have any ideas about what could have done it, I guess we
> can close this.   I am running a loop but, I'm not expecting it to find
> anything.  It's already run through 30+ restarts and no issues.

Hi Scott,

thanks for checking again. Imo there is no real issue here. Maybe in the run where you have seen the issue the restart was "too" fast and the kernel side of the socket wasn't properly cleaned up or something similar.

I think we can use the ticket to implement Jakub's suggestion and add errno and the related error message to the log messages. But I'm not sure if this is really relevant for RHEL. So as an alternative we can just clone this ticket to upstream and close it here?

bye,
Sumit

Comment 9 Scott Poore 2019-04-15 16:55:15 UTC
Sounds like a plan to me.

Comment 10 Jakub Hrozek 2019-07-04 07:36:23 UTC
Since this can't even be tested, I'm just going to close the bug as UPSTREAM.

The commit that added the better logging is 75696ddc84ab08c8c885dacc7796ebc8afc429ec


Note You need to log in before you can comment on or make changes to this bug.