Bug 1733259 - Enabling SELinux has a significant impact on systemd-journald performance
Summary: Enabling SELinux has a significant impact on systemd-journald performance
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Ondrej Mosnacek
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-07-25 14:28 UTC by Michal Sekletar
Modified: 2020-02-18 12:55 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-18 12:55:55 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
journald flamegraph - SELinux enabled (191.84 KB, image/svg+xml)
2019-07-25 14:28 UTC, Michal Sekletar
no flags Details
journald flamegraph - SELinux disabled (212.43 KB, image/svg+xml)
2019-07-25 14:28 UTC, Michal Sekletar
no flags Details
journald flamegraph - SELinux enabled (rawhide kernel) (364.12 KB, image/svg+xml)
2019-07-26 08:26 UTC, Michal Sekletar
no flags Details
proposed patch (6.59 KB, patch)
2019-07-26 13:54 UTC, Ondrej Mosnacek
no flags Details | Diff
journald flamegraph - SELinux enabled with string cache (rawhide w/release config) (126.83 KB, image/svg+xml)
2019-10-25 15:15 UTC, Ondrej Mosnacek
no flags Details

Description Michal Sekletar 2019-07-25 14:28:14 UTC
Created attachment 1593423 [details]
journald flamegraph - SELinux enabled

1. Please describe the problem:

$subject

Number of logs that journal was able to process in a certain time window is roughly 28% smaller when SELinux is enabled.

2. What is the Version-Release number of the kernel:

kernel-5.0.17-200.fc29.x86_64

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

Probably this was always the problem

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

Yes

cat /dev/urandon | base64 | systemd-cat &
timeout 30s perf record -p $(pidof systemd-journald) -a -g

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

I don't know

6. Are you running any modules that not shipped with directly Fedora's kernel?:
No

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.
Kernel logs are useless. I am rather attaching flame graphs that I generated w/ and w/o enabled SELinux.

Comment 1 Michal Sekletar 2019-07-25 14:28:42 UTC
Created attachment 1593424 [details]
journald flamegraph - SELinux disabled

Comment 2 Laura Abbott 2019-07-25 14:32:14 UTC
kernel-5.0.17-200.fc29.x86_64 is 2 months old, can you reproduce this on a newer kernel

Comment 3 Michal Sekletar 2019-07-26 08:25:44 UTC
(In reply to Laura Abbott from comment #2)
> kernel-5.0.17-200.fc29.x86_64 is 2 months old, can you reproduce this on a
> newer kernel

I now retested with kernel-5.3.0-0.rc1.git2.1.fc31.x86_64 and I observed very similar behavior. Actually, the results are a tiny bit worse, 27M of logs in 30s interval vs. 39M of logs in 30s.

Flamegraph is attached.

Comment 4 Michal Sekletar 2019-07-26 08:26:21 UTC
Created attachment 1593632 [details]
journald flamegraph - SELinux enabled (rawhide kernel)

Comment 5 Ondrej Mosnacek 2019-07-26 10:00:11 UTC
So, dumping my thoughts after some initial analysis:

The function that causes the slowdown is security_secid_to_secctx() (its SELinux implementation), which is called from unix_dgram_recvmsg() via scm_recv(), which checks if the SO_PASSSEC option is set on the socket and if it is, then it converts the security label of the sending process to string form (using security_secid_to_secctx()) and passes it to userspace via an ancillary message (see unix(7)).

It doesn't look like the frequent conversion can be simply avoided, since we are dealing with datagram sockets... Most likely the problem is simply that the conversion is just too slow (it looks like the mls_compute_context_len() and mls_sid_to_context() functions are maybe slower than they should be...).

I'll look into optimizing the mls_*() functions and if that fails then I'll see if there is some possibility to cache the stringified peer labels in AF_UNIX...

Comment 6 Michal Sekletar 2019-07-26 12:11:15 UTC
(In reply to Ondrej Mosnacek from comment #5)

> I'll look into optimizing the mls_*() functions and if that fails then I'll
> see if there is some possibility to cache the stringified peer labels in
> AF_UNIX...

Hmm, but the label that is being computed each time is not the property of the socket rather a property of communicating peer, right? Wouldn't make sense then to have the cache for it somewhere else (maybe task_struct)?

Comment 7 Ondrej Mosnacek 2019-07-26 12:22:25 UTC
OK, it looks like the problem is that the mls_*() functions traverse the category bitmaps one-by-one. So when a context contains a full category range (c0.c1023) or several long ranges, then these functions have severely suboptimal performance. I have a tentative patch that should make the conversion more optimal. On my testing VM it brings down the runtime of the two mls_*() functions from ~8000 ns down to ~900 ns (on a context containing "c0.c1023") so it looks promising. Let me polish it up and make a scratch build...

(In reply to Michal Sekletar from comment #6)
> (In reply to Ondrej Mosnacek from comment #5)
> 
> > I'll look into optimizing the mls_*() functions and if that fails then I'll
> > see if there is some possibility to cache the stringified peer labels in
> > AF_UNIX...
> 
> Hmm, but the label that is being computed each time is not the property of
> the socket rather a property of communicating peer, right? Wouldn't make
> sense then to have the cache for it somewhere else (maybe task_struct)?

Yes, the optimal place for the cache is open for discussion. Perhaps the right place would actually be SELinux itself - it could just transparently cache the sid -> context string conversions like it already caches AVC decisions. But hopefully the cache won't even be needed in the end (see above).

Comment 8 Paul Moore 2019-07-26 12:31:47 UTC
The conversion from the secid token to the secctx string format shouldn't happen on every SO_PASSSEC request; the conversion should only occur when the secid token is not present in the sidtab cache.

Comment 9 Ondrej Mosnacek 2019-07-26 12:41:41 UTC
(In reply to Paul Moore from comment #8)
> The conversion from the secid token to the secctx string format shouldn't
> happen on every SO_PASSSEC request; the conversion should only occur when
> the secid token is not present in the sidtab cache.

That's not true - sidtab only stores the 'struct context' corresponding to the SID, which only contains the full string if the context is invalid under current policy. For valid contexts the string is recomputed from scratch on each context_struct_to_string() call.

Comment 10 Paul Moore 2019-07-26 13:33:31 UTC
My apologies, I remembered things incorrectly.

In this case, I would rather we looked at caching the string generation somehow (there are a few different ways we could do this) then adding a bunch of tweaks to the the MLS ebitmap code.  In my opinion, if we are hitting the sid->secctx conversion often enough that it is a problem, it is most likely always going to be a bottleneck and implementing/improving a conversion cache will likely be a much better fix long term.

Comment 11 Ondrej Mosnacek 2019-07-26 13:52:01 UTC
Michal, I'm unable to reproduce the perf measurements on Fedora (29 & 30). It looks like stream sockets are being used instead of datagram sockets in systemd-journal - is there some additional configuration change needed to trigger the problem?

(In reply to Paul Moore from comment #10)
> In this case, I would rather we looked at caching the string generation
> somehow (there are a few different ways we could do this) then adding a
> bunch of tweaks to the the MLS ebitmap code.  In my opinion, if we are
> hitting the sid->secctx conversion often enough that it is a problem, it is
> most likely always going to be a bottleneck and implementing/improving a
> conversion cache will likely be a much better fix long term.

Well, adding the cache is going to add some complexity and is likely going to take a lot more effort to implement. The ebitmap patch (will attach it after this comment) adds only little code to ebitmap.h and actually simplifies things nicely in the MLS functions. Whether we are going to implement the cache or not, it is IMHO worth applying on its own.

Comment 12 Ondrej Mosnacek 2019-07-26 13:54:41 UTC
Created attachment 1593709 [details]
proposed patch

Attaching the ebitmap patch. May contain bugs :)

Scratch build for F30 with the patch applied in progress here:
https://koji.fedoraproject.org/koji/taskinfo?taskID=36563787

Comment 13 Paul Moore 2019-07-26 16:12:02 UTC
We've gotten clever with the ebitmap code in the past and I've ended up regretting it.  I think there is benefit in keeping the ebitmap code as clean (tweak-free) as possible and instead implementing a proper cache to deal with the poor lookup performance.

Also, if you haven't learned this yet, let me make it clear that saying something is "going to take a lot more effort" has very little impact on my opinion :)

Comment 14 Michal Sekletar 2019-07-29 14:07:48 UTC
(In reply to Ondrej Mosnacek from comment #11)
> Michal, I'm unable to reproduce the perf measurements on Fedora (29 & 30).
> It looks like stream sockets are being used instead of datagram sockets in
> systemd-journal - is there some additional configuration change needed to
> trigger the problem?

Ahh, I made a copy-paste error in reproducer. I was comparing SOCK_STREAM vs. SOCK_DGRAM performance and obviously I've copied the wrong one here. Please replace systemd-cat with "logger" and then you should be able to observe the problem.

Once again, I am very sorry for causing confusion.

Comment 15 Michal Sekletar 2019-07-29 14:16:22 UTC
Please correct me if I didn't get this right, but aren't the two proposed approaches actually orthogonal. IOW, we should speed up how we compute the context as well as cache it?

Comment 16 Paul Moore 2019-07-29 14:25:43 UTC
Caching the lookup will effectively eliminate the need to add tweaks to the MLS ebitmap code, and should provide a greater performance as well (caches the entire secid->secctx translation, not just the MLS category field).

As I mentioned above, there is value in keeping the MLS/ebitmap code simple(ish), and the cache would appear to be the more "proper" solution anyway.

Comment 17 Ondrej Mosnacek 2019-07-30 12:48:13 UTC
(In reply to Paul Moore from comment #16)
> Caching the lookup will effectively eliminate the need to add tweaks to the
> MLS ebitmap code, and should provide a greater performance as well (caches
> the entire secid->secctx translation, not just the MLS category field).
> 
> As I mentioned above, there is value in keeping the MLS/ebitmap code
> simple(ish), and the cache would appear to be the more "proper" solution
> anyway.

I wouldn't want to overcomplicate the ebitmap code either, but I don't think the proposed patch does that. It merely adds a new helper, sticking closely to the logic and style of the existing code. There may be value in avoiding intrusive code changes in ebitmap.[hc], but not in avoiding changes that actually improve things.

Regardless of whether we implement the cache (and I do agree we should), pointlessly calling find_next_bit() 1024 times for what is one of the most common category sets (c0.c1023) is just ridiculous.

Comment 18 Ondrej Mosnacek 2019-07-30 12:50:53 UTC
Comment on attachment 1593709 [details]
proposed patch

FWIW, I posted the patch properly to the list now:
https://patchwork.kernel.org/patch/11065721/

Comment 19 Stephen Smalley 2019-07-30 13:09:48 UTC
We didn't originally keep a copy of the string in the sidtab for two reasons:
1) Memory cost - context strings can potentially be quite large and there can be many of them, keeping them all around forever could be problematic
2) Aliasing - there can be multiple string representations of the same security context due to user/role/type/sensitivity/category aliases, the compact category set notation (c0.c1023) versus enumerating them individually, differing category ordering, etc.  Context strings have to be normalized (via security_context_to_sid() followed by security_sid_to_context() or string_to_context_struct() followed by context_struct_to_string() internally in order to be useful for comparisons.

When the raw/unmapped context support was added, we stored those because we cannot map them to a context struct since they aren't known to the loaded policy.
Presently the code presumes that if ->len or ->str are set then the context is unmapped, so if you start storing strings for them all, you'll need a separate field to indicate whether they are mapped.

Comment 20 Paul Moore 2019-07-30 14:33:11 UTC
Now that the patch has been posted upstream on the mailing list, I suggest we move the upstream related discussion there.

Comment 22 Justin M. Forbes 2019-08-20 17:41:33 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 29 kernel bugs.

Fedora 29 has now been rebased to 5.2.9-100.fc29.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 30, and are still experiencing this issue, please change the version to Fedora 30.

If you experience different issues, please open a new bug report for those.

Comment 23 Ondrej Mosnacek 2019-08-20 18:57:15 UTC
This issue is present also in the upstream kernel, so moving the bug to rawhide.

Comment 24 Ondrej Mosnacek 2019-10-25 15:08:50 UTC
I posted a patch [1] that implements the SID -> string cache as the alternate approach suggested above. This patch reduces the overhead of security_context_to_sid() from 65% down to only 2% in the systemd-journald scenario provided in the description (cache hit ratio is almost 100% here, since the same context is used repeatedly).

A scratch build of a recent rawhide kernel with an only cosmetically different patch applied is available at [2]. Note that this build uses the release kernel configuration instead of the debug one (slower) which is default for rawhide kernels.

[1] https://lore.kernel.org/selinux/20191025145535.18474-1-omosnace@redhat.com/T/
[2] https://koji.fedoraproject.org/koji/taskinfo?taskID=38524856

Comment 25 Ondrej Mosnacek 2019-10-25 15:15:51 UTC
Created attachment 1629248 [details]
journald flamegraph - SELinux enabled with string cache (rawhide w/release config)

Attaching  a flamegraph for the scenario with scratch build 38524793 installed.

Comment 26 Ondrej Mosnacek 2019-12-10 08:31:42 UTC
A patch that resolves this issue is now staged in the selinux/next branch:

https://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux.git/commit/?h=next&id=d97bd23c2d7d866e99eb3a927c742715c85a90ef

Unless something bad happens, it should become part of the 5.6 kernel release.

Comment 27 Ondrej Mosnacek 2020-02-18 12:55:55 UTC
This should now be resolved in rawhide.


Note You need to log in before you can comment on or make changes to this bug.