Bug 2088481 - selinux_child: Cannot beign SELinux transaction
Summary: selinux_child: Cannot beign SELinux transaction
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: sssd
Version: 34
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Alexey Tikhonov
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: sync-to-jira
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-19 14:24 UTC by Rob Crittenden
Modified: 2022-07-05 11:59 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-06-02 13:51:14 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker SSSD-4878 0 None None None 2022-07-01 20:24:03 UTC

Description Rob Crittenden 2022-05-19 14:24:32 UTC
Description of problem:

Some logins fail under load due to contention over the SELinux transaction lock in libsemanage.

I'm trying to gather scalability information on IPA. The test revolves around bringing up a number of client VMs, enrolling as IPA clients and running a forking client that does a PAM transaction with the login service. Each client has a set of unique users with non-expired passwords to authenticate with. No pre-loading of the cache is done.

In this particular case I brought up 10 clients and ran the test tool do to 10 logins. All of these 100 logins happen more or less simultaneously (time is synced). 30 of the authentications failed due to the transaction locking.

IPA should be able to handle 100 authentications without breaking a sweat and the server-side logs don't show any issues but it may be somewhat load related. If I re-run the test tool after the fact, even clearing the SSSD cache I can't reproduce the transaction failure.

I wonder if a retry could be implemented.

The selinux_child log contains the following. It is consistent across all the clients (log level 3):

(2022-05-19 13:42:01): [selinux_child[11668]] [libsemanage] (0x0020): Could not get direct transaction lock at /var/lib/selinux/targeted/semanage.trans.LOCK.
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
   *  (2022-05-19 13:41:56): [selinux_child[11668]] [main] (0x0400): selinux_child started.
   *  (2022-05-19 13:41:56): [selinux_child[11668]] [main] (0x2000): Running with effective IDs: [0][0].
   *  (2022-05-19 13:41:56): [selinux_child[11668]] [main] (0x2000): Running with real IDs [0][0].
   *  (2022-05-19 13:41:56): [selinux_child[11668]] [main] (0x0400): context initialized
   *  (2022-05-19 13:41:56): [selinux_child[11668]] [unpack_buffer] (0x2000): seuser length: 12
   *  (2022-05-19 13:41:56): [selinux_child[11668]] [unpack_buffer] (0x2000): seuser: unconfined_u
   *  (2022-05-19 13:41:56): [selinux_child[11668]] [unpack_buffer] (0x2000): mls_range length: 14
   *  (2022-05-19 13:41:56): [selinux_child[11668]] [unpack_buffer] (0x2000): mls_range: s0-s0:c0.c1023
   *  (2022-05-19 13:41:56): [selinux_child[11668]] [unpack_buffer] (0x2000): username length: 23
   *  (2022-05-19 13:41:56): [selinux_child[11668]] [unpack_buffer] (0x2000): username: user6client000.ipa.test
   *  (2022-05-19 13:41:56): [selinux_child[11668]] [main] (0x0400): performing selinux operations
   *  (2022-05-19 13:41:56): [selinux_child[11668]] [seuser_needs_update] (0x2000): sss_get_seuser: ret: 0 seuser: unconfined_u mls: s0-s0:c0.c1023
   *  (2022-05-19 13:41:56): [selinux_child[11668]] [sss_seuser_exists] (0x0400): seuser exists: no
   *  (2022-05-19 13:41:56): [selinux_child[11668]] [seuser_needs_update] (0x0400): The SELinux user does need an update
   *  (2022-05-19 13:42:01): [selinux_child[11668]] [libsemanage] (0x0020): Could not get direct transaction lock at /var/lib/selinux/targeted/semanage.trans.LOCK.
********************** BACKTRACE DUMP ENDS HERE *********************************

(2022-05-19 13:42:01): [selinux_child[11668]] [sss_set_seuser] (0x0020): Cannot begin SELinux transaction
(2022-05-19 13:42:01): [selinux_child[11668]] [main] (0x0020): Cannot set SELinux login context.
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
   *  (2022-05-19 13:42:01): [selinux_child[11668]] [sss_set_seuser] (0x0020): Cannot begin SELinux transaction
   *  (2022-05-19 13:42:01): [selinux_child[11668]] [main] (0x0020): Cannot set SELinux login context.
********************** BACKTRACE DUMP ENDS HERE *********************************

(2022-05-19 13:42:01): [selinux_child[11668]] [main] (0x0020): selinux_child failed!
(2022-05-19 13:42:01): [selinux_child[11671]] [libsemanage] (0x0020): Could not get direct transaction lock at /var/lib/selinux/targeted/semanage.trans.LOCK.
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
   *  (2022-05-19 13:41:56): [selinux_child[11671]] [main] (0x0400): selinux_child started.
   *  (2022-05-19 13:41:56): [selinux_child[11671]] [main] (0x2000): Running with effective IDs: [0][0].
   *  (2022-05-19 13:41:56): [selinux_child[11671]] [main] (0x2000): Running with real IDs [0][0].
   *  (2022-05-19 13:41:56): [selinux_child[11671]] [main] (0x0400): context initialized
   *  (2022-05-19 13:41:56): [selinux_child[11671]] [unpack_buffer] (0x2000): seuser length: 12
   *  (2022-05-19 13:41:56): [selinux_child[11671]] [unpack_buffer] (0x2000): seuser: unconfined_u
   *  (2022-05-19 13:41:56): [selinux_child[11671]] [unpack_buffer] (0x2000): mls_range length: 14
   *  (2022-05-19 13:41:56): [selinux_child[11671]] [unpack_buffer] (0x2000): mls_range: s0-s0:c0.c1023
   *  (2022-05-19 13:41:56): [selinux_child[11671]] [unpack_buffer] (0x2000): username length: 23
   *  (2022-05-19 13:41:56): [selinux_child[11671]] [unpack_buffer] (0x2000): username: user1client000.ipa.test
   *  (2022-05-19 13:41:56): [selinux_child[11671]] [main] (0x0400): performing selinux operations
   *  (2022-05-19 13:41:56): [selinux_child[11671]] [seuser_needs_update] (0x2000): sss_get_seuser: ret: 0 seuser: unconfined_u mls: s0-s0:c0.c1023
   *  (2022-05-19 13:41:56): [selinux_child[11671]] [sss_seuser_exists] (0x0400): seuser exists: no
   *  (2022-05-19 13:41:56): [selinux_child[11671]] [seuser_needs_update] (0x0400): The SELinux user does need an update
   *  (2022-05-19 13:42:01): [selinux_child[11671]] [libsemanage] (0x0020): Could not get direct transaction lock at /var/lib/selinux/targeted/semanage.trans.LOCK.
********************** BACKTRACE DUMP ENDS HERE *********************************

(2022-05-19 13:42:01): [selinux_child[11671]] [sss_set_seuser] (0x0020): Cannot begin SELinux transaction
(2022-05-19 13:42:01): [selinux_child[11671]] [main] (0x0020): Cannot set SELinux login context.
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
   *  (2022-05-19 13:42:01): [selinux_child[11671]] [sss_set_seuser] (0x0020): Cannot begin SELinux transaction
   *  (2022-05-19 13:42:01): [selinux_child[11671]] [main] (0x0020): Cannot set SELinux login context.
********************** BACKTRACE DUMP ENDS HERE *********************************

(2022-05-19 13:42:01): [selinux_child[11671]] [main] (0x0020): selinux_child failed!

Version-Release number of selected component (if applicable):

sssd-common-2.5.2-2.fc34

Comment 1 Alexey Tikhonov 2022-05-24 18:14:13 UTC
Failing function is `semanage_begin_transaction()`:
```
/* Attempt to obtain a transaction lock on the manager.  If another
 * process has the lock then this function may block, depending upon
 * the timeout value in the handle.
 *
 * Note that if the semanage_handle has not yet obtained a transaction
 * lock whenever a writer function is called, there will be an
 * implicit call to this function. */
extern int semanage_begin_transaction(semanage_handle_t *);
```

From a quick glance I don't find an API to specify this timeout explicitly.
But https://github.com/SELinuxProject/selinux/blob/0a8c177dacdc1df96ea11bb8aa75e16c4fa82285/libsemanage/src/handle.c#L101:
```
sh->timeout = SEMANAGE_COMMIT_READ_WAIT;
```
and https://github.com/SELinuxProject/selinux/blob/0a8c177dacdc1df96ea11bb8aa75e16c4fa82285/libsemanage/src/handle.c#L40:
```
https://github.com/SELinuxProject/selinux/blob/0a8c177dacdc1df96ea11bb8aa75e16c4fa82285/libsemanage/src/handle.c#L40
```

This matches timestamps in the log:
```
   *  (2022-05-19 13:41:56): [selinux_child[11668]] [seuser_needs_update] (0x0400): The SELinux user does need an update
   *  (2022-05-19 13:42:01): [selinux_child[11668]] [libsemanage] (0x0020): Could not get direct transaction lock at /var/lib/selinux/targeted/semanage.trans.LOCK.
```

I'm not sure a retry would help. If a retry would succeed immediately this would mean a bug in `libsemanage`, imo.

Btw, Rob, if your goal is a load testing of a *server* then you could consider setting `selinux_provider = none` on the client.
It will reduce amount of fetched data, but I think overall request rate will be increased.

Comment 2 Rob Crittenden 2022-05-24 18:21:45 UTC
Thanks, Sumit suggested the same. I'm trying at first with vanilla client and server installs and I'll go from there. Modifying this on all clients is trivial with Ansible or other config management tools so users that don't do SELinux assignments would be fine with that, but I'm focusing on the defaults right now.

Comment 3 Andre Boscatto 2022-06-02 13:51:14 UTC
Hi Rob,

We discussed this BZ during our weekly call a couple of minutes ago and Sumit gave a perfect background overview about what is happening.

It seems to be a collateral effect of the Load/Performance Test setup, which is something expected from this kind of situation.

With that being said, I will go ahead and close it as NOTABUG, please re-open if you disagree or if there is any flaw we might find in a real case scenario that should be indeed addressed.

Kindly

Comment 4 Alexey Tikhonov 2022-07-01 14:21:14 UTC
JFTR: I disagree with the resolution.

- limits aren't documented
- there is no clear message neither to the user nor to the admin explaining what has happened
- this creates artificial bottleneck and degrades performance limits overall for no good reason

There is clearly a room for improvement. Perhaps scenario is not important enough to justify such improvement, so resolution could be "WONTFIX". But "NOTBUG" sounds like denial of any issue, and I think this is wrong.

Comment 5 Andre Boscatto 2022-07-01 20:21:20 UTC
As discussed with Alexey in private, I am changing the resolution from NOTABUG to WONTFIX. 

Plans are to, at least, take a look into this during CY22Q4 or later (CY23Q1), to validate what can be done to improve the user experience (the load is low, and we have similar cases already opened in the past).

Rob, if you have new information by then, please let us know, so we can take it into account the new results from the benchmark you are working on.

Last but not least, thanks, Alexey, for pointing it out and explaining it differently and bringing another point of view, I do really appreciate it.

Kindly

Comment 6 Rob Crittenden 2022-07-05 11:59:56 UTC
It's up to you I guess but if enough people, and it doesn't require many, log into a sssd-managed system simultaneously it's very possible for one or more to fail because of contention on the SELinux lock.


Note You need to log in before you can comment on or make changes to this bug.