1654537 – SSSD 2.0 has drastically lower sbus timeout than 1.x, this can result in time outs

Bug 1654537 - SSSD 2.0 has drastically lower sbus timeout than 1.x, this can result in time outs

Summary: SSSD 2.0 has drastically lower sbus timeout than 1.x, this can result in time...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	sssd
Sub Component:
Version:	29
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Jakub Hrozek
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1661183
TreeView+	depends on / blocked

Reported:	2018-11-29 02:58 UTC by Adam Williamson
Modified:	2018-12-20 10:29 UTC (History)
CC List:	9 users (show)
Fixed In Version:	sssd-2.0.0-5.fc29
Clone Of:
Clones:	1661183 (view as bug list)
Environment:
Last Closed:	2018-12-17 19:12:13 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
/var/log tarball from client (1.94 MB, application/octet-stream) 2018-11-29 02:59 UTC, Adam Williamson	no flags	Details
/var/log tarball from server (3.61 MB, application/octet-stream) 2018-11-29 03:00 UTC, Adam Williamson	no flags	Details
client logs with sssd loglevel 9 (1.40 MB, application/octet-stream) 2018-11-30 17:31 UTC, Adam Williamson	no flags	Details
View All

Description Adam Williamson 2018-11-29 02:58:53 UTC

Sorry this is a bit vague for now, but I wanted to get it filed so other folks can take a look at it.

In the last 10 days or so, periodically the openQA FreeIPA tests are running into a failure when logging into a console on a domain member system as a domain user (who has the right to log into that system). What happens is the test enters the username and password, the system grinds away for a bit, then the message "System error" appears on the screen for a few seconds and it cycles back to the login prompt.

The logs don't provide a whole lot more detail, here's what I can see:

Nov 21 13:40:23 client002.domain.local audit[2993]: USER_ACCT pid=2993 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:local_login_t:s0-s0:c0.c1023 msg='op=PAM:accounting grantors=? acct="test3" exe="/usr/bin/login" hostname=client002.domain.local addr=? terminal=tty4 res=failed'
Nov 21 13:40:23 client002.domain.local login[2993]: pam_sss(login:account): Access denied for user test3: 4 (System error)
Nov 21 13:40:23 client002.domain.local login[2993]: System error

that's from the client journal. The server journal has nothing at all around the same time. There are entries in the dirsrv access log, I'm not sure if any of them are of any interest.

This problem is intermittent; it happens on something like 1 in 3 or 1 in 4 tests. Often it doesn't happen and the tests pass just fine. Because it's intermittent, I can't pin down the initial cause, unfortunately: the *first* update for which I can see a failure like this was an update to xen, so it's pretty clear that isn't the cause.

I'm attaching tarballs of /var/log from both the client and the server for the earliest occurrence of this that I can find ATM.

Client test: https://openqa.fedoraproject.org/tests/310841
Server test: https://openqa.fedoraproject.org/tests/310838

There are two failed attempts to log in. The useful one was as 'test3' which occurs at 13:39:58 UTC with the "System error" being logged at 13:40:23 UTC. There is a later attempt to log in as 'test1', but that one's tricky to analyze as the timestamps are off - after the first failure, openQA loads a snapshot of a 'known good' state, which messes up the comparison between timestamps in the openQA log and the system logs. I'll attach the log tarballs just for the first failure, so things are clear.

Comment 1 Adam Williamson 2018-11-29 02:59:32 UTC

Created attachment 1509726 [details]
/var/log tarball from client

Comment 2 Adam Williamson 2018-11-29 03:00:39 UTC

Created attachment 1509727 [details]
/var/log tarball from server

Comment 3 Sumit Bose 2018-11-29 08:41:25 UTC

Hi Adam,

do you have a chance to re-run the tests with 'debug_level=9' at least in the [pam] and [domain/...] section of sssd.conf?

'System error' is typically used by SSSD to handle some internal error.

bye,
Sumit

Comment 4 Adam Williamson 2018-11-29 16:40:03 UTC

Thanks Sumit. I'll try tweaking the staging instance to do that. Then we'll have to wait until it happens to bite another update...or I can just re-run the test manually over and over till it happens, I guess.

Comment 5 Adam Williamson 2018-11-30 17:31:31 UTC

Created attachment 1510214 [details]
client logs with sssd loglevel 9

OK, here's a log tarball of a case where it failed with the debug log level set to 9. This is from:

https://openqa.stg.fedoraproject.org/tests/415166

The failed login attempt as test1 is tried somewhere around 15:47:52 (UTC) and the 'System error' occurs at 15:48:17 (UTC). That's 10:47:52 and 10:48:17 in the sssd log files.

Hope this helps.

Comment 6 Adam Williamson 2018-11-30 17:33:52 UTC

Hum, this in sssd_pam.log looks interesting:

(Fri Nov 30 10:47:52 2018) [sssd[pam]] [pam_dp_send_req] (0x0100): Sending request with the following data:
(Fri Nov 30 10:47:52 2018) [sssd[pam]] [pam_print_data] (0x0100): command: SSS_PAM_ACCT_MGMT
(Fri Nov 30 10:47:52 2018) [sssd[pam]] [pam_print_data] (0x0100): domain: domain.local
(Fri Nov 30 10:47:52 2018) [sssd[pam]] [pam_print_data] (0x0100): user: test1
(Fri Nov 30 10:47:52 2018) [sssd[pam]] [pam_print_data] (0x0100): service: login
(Fri Nov 30 10:47:52 2018) [sssd[pam]] [pam_print_data] (0x0100): tty: tty2
(Fri Nov 30 10:47:52 2018) [sssd[pam]] [pam_print_data] (0x0100): ruser: not set
(Fri Nov 30 10:47:52 2018) [sssd[pam]] [pam_print_data] (0x0100): rhost: not set
(Fri Nov 30 10:47:52 2018) [sssd[pam]] [pam_print_data] (0x0100): authtok type: 0
(Fri Nov 30 10:47:52 2018) [sssd[pam]] [pam_print_data] (0x0100): newauthtok type: 0
(Fri Nov 30 10:47:52 2018) [sssd[pam]] [pam_print_data] (0x0100): priv: 1
(Fri Nov 30 10:47:52 2018) [sssd[pam]] [pam_print_data] (0x0100): cli_pid: 2308
(Fri Nov 30 10:47:52 2018) [sssd[pam]] [pam_print_data] (0x0100): logon name: test1
(Fri Nov 30 10:47:52 2018) [sssd[pam]] [pam_print_data] (0x0100): flags: 0
(Fri Nov 30 10:47:52 2018) [sssd[pam]] [pam_dom_forwarder] (0x0100): pam_dp_send_req returned 0
(Fri Nov 30 10:47:56 2018) [sssd[pam]] [pam_initgr_cache_remove] (0x2000): [test1] removed from PAM initg
roup cache
(Fri Nov 30 10:47:57 2018) [sssd[pam]] [pam_initgr_cache_remove] (0x2000): [test1] removed from PAM initgroup cache
(Fri Nov 30 10:47:57 2018) [sssd[pam]] [pam_initgr_cache_remove] (0x2000): [test1] removed from PAM initgroup cache
(Fri Nov 30 10:48:17 2018) [sssd[pam]] [sbus_dispatch] (0x4000): Dispatching.
(Fri Nov 30 10:48:17 2018) [sssd[pam]] [sbus_reply_check] (0x4000): D-Bus error [org.freedesktop.DBus.Error.NoReply]: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
(Fri Nov 30 10:48:17 2018) [sssd[pam]] [pam_dp_send_req_done] (0x0020): PAM handler failed [1432158312]: No reply was received

Comment 7 Adam Williamson 2018-12-04 20:01:27 UTC

ping?

Comment 8 Sumit Bose 2018-12-05 11:11:07 UTC

Hi Adam,

sorry for the delay. 

I think the reason as the same as in https://bugzilla.redhat.com/show_bug.cgi?id=1644919, the selinux_child needs lots of time to finish. 

To me it looks that the 'System error' is returned if selinux_child needs 15s or more and the request on the internal D-Bus connection times out as you have shown in the log snippet above. 

If I understand it correctly we let libdbus decide about the timeout because we call e.g. dbus_connection_send_with_reply() with a timeout value of '-1'. Pavel, is this correct and is there a way to tune this timeout?

Currently I cannot say what causes the delay in selinux_child. At about the time the selinux_child finishes there are the following messages in /var/log/messages:

Nov 30 10:48:15 client003 kernel: SELinux:  Class xdp_socket not defined in policy.
Nov 30 10:48:15 client003 kernel: SELinux: the above unknown classes and permissions will be allowed
Nov 30 10:48:16 client003 kernel: SELinux:  Context unconfined_u:unconfined_r:sandbox_t:s0-s0:c0.c1023 became invalid (unmapped).
Nov 30 10:48:16 client003 kernel: SELinux:  Context system_u:unconfined_r:sandbox_t:s0-s0:c0.c1023 became invalid (unmapped).
Nov 30 10:48:17 client003 kernel: SELinux:  policy capability network_peer_controls=1
Nov 30 10:48:17 client003 kernel: SELinux:  policy capability open_perms=1
Nov 30 10:48:17 client003 kernel: SELinux:  policy capability extended_socket_class=1
Nov 30 10:48:17 client003 kernel: SELinux:  policy capability always_check_network=0
Nov 30 10:48:17 client003 kernel: SELinux:  policy capability cgroup_seclabel=1
Nov 30 10:48:17 client003 kernel: SELinux:  policy capability nnp_nosuid_transition=1

I'm not sure if maybe the 'became invalid' messages are of importance here?

bye,
Sumit

Comment 9 Jakub Hrozek 2018-12-05 11:39:34 UTC

(In reply to Sumit Bose from comment #8)
> Currently I cannot say what causes the delay in selinux_child. At about the
> time the selinux_child finishes there are the following messages in
> /var/log/messages:

I think the reason might be a fix for https://pagure.io/SSSD/sssd/issue/3819, see commit https://pagure.io/SSSD/sssd/c/945865a

tl;dr adding the user-role mapping to semanage is an expensive operation. Previously we only added the mapping if the user's SELinux label was different from the default (typically, different from unconfined) but this was breaking the case where the home directory is non-standard, because in that case you need the epxlicit mapping so that libselinux also knows that the files under this user's home directory should be labeled as home_dir_t.

It would be trivial to revert that fix and provide a test build to prove this theory.

This would also correlate with the other bug that Adam filed where the login suddenly takes longer. This is an issue I was sort of afraid of when I implemented the patch but at the same time I didn't see any other way around the bug. Since explicitly adding the user to the selinux database is needed to create the mappings for the user's home directory, maybe we could also call some libsemanage/libselinux function to see if the home directory already has the home_t mapping and avoid adding an uncoditional call.

Comment 10 Pavel Březina 2018-12-05 11:55:15 UTC

If there is a real need for bigger timeout we can set a custom value instead of -1.

Comment 11 Sumit Bose 2018-12-05 12:12:23 UTC

(In reply to Pavel Březina from comment #10)
> If there is a real need for bigger timeout we can set a custom value instead
> of -1.

I was mainly guessing that the D_Bus error was caused by a timeout, the message is a bit vague here. Do you know what libdbus would pick as default timeout?

Comment 12 Pavel Březina 2018-12-05 15:06:11 UTC

There is this 25 second delay before reply to the dbus method is sent, so I it is a timeout issue, although we almost made it though as default dbus timeout is  25 seconds (_DBUS_DEFAULT_TIMEOUT_VALUE in dbus code). So it was actually a race  what finishes sooner.

(Fri Nov 30 10:47:52 2018) [sssd[be[domain.local]]] [write_pipe_handler] (0x0400): All data has been sent!
(Fri Nov 30 10:48:17 2018) [sssd[be[domain.local]]] [child_sig_handler] (0x1000): Waiting for child [2316].
(Fri Nov 30 10:48:17 2018) [sssd[be[domain.local]]] [child_sig_handler] (0x0100): child [2316] finished successfully.
(Fri Nov 30 10:48:17 2018) [sssd[be[domain.local]]] [read_pipe_handler] (0x0400): EOF received, client finished
(Fri Nov 30 10:48:17 2018) [sssd[be[domain.local]]] [dp_req_done] (0x0400): DP Request [PAM SELinux #6]: Request handler finished [0]: Success
(Fri Nov 30 10:48:17 2018) [sssd[be[domain.local]]] [_dp_req_recv] (0x0400): DP Request [PAM SELinux #6]: Receiving request data.
(Fri Nov 30 10:48:17 2018) [sssd[be[domain.local]]] [dp_req_destructor] (0x0400): DP Request [PAM SELinux #6]: Request removed.
(Fri Nov 30 10:48:17 2018) [sssd[be[domain.local]]] [dp_req_destructor] (0x0400): Number of active DP request: 0
(Fri Nov 30 10:48:17 2018) [sssd[be[domain.local]]] [sbus_issue_request_done] (0x0400): sssd.dataprovider.pamHandler: Success

From selinux child:
(Fri Nov 30 10:47:52 2018) [[sssd[selinux_child[2316]]]] [seuser_needs_update] (0x0400): The SELinux user does need an update
(Fri Nov 30 10:48:17 2018) [[sssd[selinux_child[2316]]]] [pack_buffer] (0x0400): result [0]

I think that sc_set_seuser just takes too long and if possible we should make it faster or increase the timeout:
    needs_update = seuser_needs_update(username, ibuf->seuser,
                                       ibuf->mls_range);
    if (needs_update == true) {
        ret = sc_set_seuser(username, ibuf->seuser, ibuf->mls_range);
        if (ret != EOK) {
            DEBUG(SSSDBG_CRIT_FAILURE, "Cannot set SELinux login context.\n");
            goto fail;
        }
    }

Comment 13 Adam Williamson 2018-12-05 16:42:25 UTC

Aha, thanks for the info, guys. Maybe we should do two builds: one with a *shorter* timeout and one with a *longer* timeout. Then I can verify that the bug happens more often (or all the time) with the short timeout, and run the build with a longer timeout through a few iterations and make sure the bug doesn't happen. That should confirm the theory?

Comment 14 Adam Williamson 2018-12-07 03:08:45 UTC

So, I spent an hour trying to pick through the thicket of bits in sssd to work out where to change the timeout that side, then realized it'd be much easier to just change it in dbus :P

So I did a couple of scratch dbus builds, one with the timeout dropped to 15 secs, one with it bumped to 60 secs. I ran the tests twice with the reduced timeout and they failed both times; I ran twice with the bumped timeout and they passed both times. I can run a few more times to confirm the impression, but it does look a lot like the timeout is indeed the issue here.

Comment 15 Jakub Hrozek 2018-12-07 08:06:50 UTC

(In reply to Adam Williamson from comment #14)
> So, I spent an hour trying to pick through the thicket of bits in sssd to
> work out where to change the timeout that side, then realized it'd be much
> easier to just change it in dbus :P
> 
> So I did a couple of scratch dbus builds, one with the timeout dropped to 15
> secs, one with it bumped to 60 secs. I ran the tests twice with the reduced
> timeout and they failed both times; I ran twice with the bumped timeout and
> they passed both times. I can run a few more times to confirm the
> impression, but it does look a lot like the timeout is indeed the issue here.

Great, thank you very much for testing.

Pavel, can you point us to the place where we can set the timeout (or send a PR before you leave for the break) ?

Comment 16 Pavel Březina 2018-12-07 10:00:30 UTC

$ git grep SBUS_MESSAGE_TIMEOUT
src/sbus/request/sbus_request.c:    subreq = sbus_message_send(state, conn, msg, SBUS_MESSAGE_TIMEOUT);
src/sbus/sbus_message.h:#define SBUS_MESSAGE_TIMEOUT -1
src/sbus/sbus_sync.h: * @see SBUS_MESSAGE_TIMEOUT
src/sbus/sync/sbus_sync.c:    ret = sbus_sync_message_send(NULL, conn, msg, SBUS_MESSAGE_TIMEOUT, NULL);
src/sbus/sync/sbus_sync_call.c:    ret = sbus_sync_message_send(tmp_ctx, conn, msg, SBUS_MESSAGE_TIMEOUT,

It is sufficient to change the definition of SBUS_MESSAGE_TIMEOUT in src/sbus/sbus_sync.h. Prior 2.0, the timeout was SSS_CLI_SOCKET_TIMEOUT/2 (= 150 seconds) so we can set it to this value as well, although it is ridiculously long. This can be used as a hotfix but we should check with selinux if there is any way to speed the operation up?

Comment 17 Adam Williamson 2018-12-13 06:37:42 UTC

I sent a PR that extends SBUS_MESSAGE_TIMEOUT, for now anyway:

https://github.com/SSSD/sssd/pull/715

CCing mgrepl for the SELinux question - mgrepl, see comments #8 and #9 for more detail on the expensive SELinux operation that's going on here, we want to know if it can be made...less expensive. :)

Comment 18 Fedora Update System 2018-12-13 07:58:30 UTC

sssd-2.0.0-5.fc29 has been submitted as an update to Fedora 29. https://bodhi.fedoraproject.org/updates/FEDORA-2018-8f0282af53

Comment 19 Jakub Hrozek 2018-12-13 13:30:13 UTC

(In reply to Adam Williamson from comment #17)
> I sent a PR that extends SBUS_MESSAGE_TIMEOUT, for now anyway:
> 
> https://github.com/SSSD/sssd/pull/715
> 
> CCing mgrepl for the SELinux question - mgrepl, see comments #8 and #9 for
> more detail on the expensive SELinux operation that's going on here, we want
> to know if it can be made...less expensive. :)

btw when I was looking the original bug I was in contact with Petr Lautrbach, not with mgrepl (just saying so that mgrepl knows who to redirect the needinfo to..)

Comment 20 Adam Williamson 2018-12-13 14:55:22 UTC

aha, OK. mgrepl is just my selinux goto. :)

Comment 21 Fedora Update System 2018-12-17 19:12:13 UTC

sssd-2.0.0-5.fc29 has been pushed to the Fedora 29 stable repository. If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.