Created attachment 1593423 [details]
journald flamegraph - SELinux enabled
1. Please describe the problem:
Number of logs that journal was able to process in a certain time window is roughly 28% smaller when SELinux is enabled.
2. What is the Version-Release number of the kernel:
3. Did it work previously in Fedora? If so, what kernel version did the issue
*first* appear? Old kernels are available for download at
Probably this was always the problem
4. Can you reproduce this issue? If so, please provide the steps to reproduce
the issue below:
cat /dev/urandon | base64 | systemd-cat &
timeout 30s perf record -p $(pidof systemd-journald) -a -g
5. Does this problem occur with the latest Rawhide kernel? To install the
Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
``sudo dnf update --enablerepo=rawhide kernel``:
I don't know
6. Are you running any modules that not shipped with directly Fedora's kernel?:
7. Please attach the kernel logs. You can get the complete kernel log
for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
issue occurred on a previous boot, use the journalctl ``-b`` flag.
Kernel logs are useless. I am rather attaching flame graphs that I generated w/ and w/o enabled SELinux.
Created attachment 1593424 [details]
journald flamegraph - SELinux disabled
kernel-5.0.17-200.fc29.x86_64 is 2 months old, can you reproduce this on a newer kernel
(In reply to Laura Abbott from comment #2)
> kernel-5.0.17-200.fc29.x86_64 is 2 months old, can you reproduce this on a
> newer kernel
I now retested with kernel-5.3.0-0.rc1.git2.1.fc31.x86_64 and I observed very similar behavior. Actually, the results are a tiny bit worse, 27M of logs in 30s interval vs. 39M of logs in 30s.
Flamegraph is attached.
Created attachment 1593632 [details]
journald flamegraph - SELinux enabled (rawhide kernel)
So, dumping my thoughts after some initial analysis:
The function that causes the slowdown is security_secid_to_secctx() (its SELinux implementation), which is called from unix_dgram_recvmsg() via scm_recv(), which checks if the SO_PASSSEC option is set on the socket and if it is, then it converts the security label of the sending process to string form (using security_secid_to_secctx()) and passes it to userspace via an ancillary message (see unix(7)).
It doesn't look like the frequent conversion can be simply avoided, since we are dealing with datagram sockets... Most likely the problem is simply that the conversion is just too slow (it looks like the mls_compute_context_len() and mls_sid_to_context() functions are maybe slower than they should be...).
I'll look into optimizing the mls_*() functions and if that fails then I'll see if there is some possibility to cache the stringified peer labels in AF_UNIX...
(In reply to Ondrej Mosnacek from comment #5)
> I'll look into optimizing the mls_*() functions and if that fails then I'll
> see if there is some possibility to cache the stringified peer labels in
Hmm, but the label that is being computed each time is not the property of the socket rather a property of communicating peer, right? Wouldn't make sense then to have the cache for it somewhere else (maybe task_struct)?
OK, it looks like the problem is that the mls_*() functions traverse the category bitmaps one-by-one. So when a context contains a full category range (c0.c1023) or several long ranges, then these functions have severely suboptimal performance. I have a tentative patch that should make the conversion more optimal. On my testing VM it brings down the runtime of the two mls_*() functions from ~8000 ns down to ~900 ns (on a context containing "c0.c1023") so it looks promising. Let me polish it up and make a scratch build...
(In reply to Michal Sekletar from comment #6)
> (In reply to Ondrej Mosnacek from comment #5)
> > I'll look into optimizing the mls_*() functions and if that fails then I'll
> > see if there is some possibility to cache the stringified peer labels in
> > AF_UNIX...
> Hmm, but the label that is being computed each time is not the property of
> the socket rather a property of communicating peer, right? Wouldn't make
> sense then to have the cache for it somewhere else (maybe task_struct)?
Yes, the optimal place for the cache is open for discussion. Perhaps the right place would actually be SELinux itself - it could just transparently cache the sid -> context string conversions like it already caches AVC decisions. But hopefully the cache won't even be needed in the end (see above).
The conversion from the secid token to the secctx string format shouldn't happen on every SO_PASSSEC request; the conversion should only occur when the secid token is not present in the sidtab cache.
(In reply to Paul Moore from comment #8)
> The conversion from the secid token to the secctx string format shouldn't
> happen on every SO_PASSSEC request; the conversion should only occur when
> the secid token is not present in the sidtab cache.
That's not true - sidtab only stores the 'struct context' corresponding to the SID, which only contains the full string if the context is invalid under current policy. For valid contexts the string is recomputed from scratch on each context_struct_to_string() call.
My apologies, I remembered things incorrectly.
In this case, I would rather we looked at caching the string generation somehow (there are a few different ways we could do this) then adding a bunch of tweaks to the the MLS ebitmap code. In my opinion, if we are hitting the sid->secctx conversion often enough that it is a problem, it is most likely always going to be a bottleneck and implementing/improving a conversion cache will likely be a much better fix long term.
Michal, I'm unable to reproduce the perf measurements on Fedora (29 & 30). It looks like stream sockets are being used instead of datagram sockets in systemd-journal - is there some additional configuration change needed to trigger the problem?
(In reply to Paul Moore from comment #10)
> In this case, I would rather we looked at caching the string generation
> somehow (there are a few different ways we could do this) then adding a
> bunch of tweaks to the the MLS ebitmap code. In my opinion, if we are
> hitting the sid->secctx conversion often enough that it is a problem, it is
> most likely always going to be a bottleneck and implementing/improving a
> conversion cache will likely be a much better fix long term.
Well, adding the cache is going to add some complexity and is likely going to take a lot more effort to implement. The ebitmap patch (will attach it after this comment) adds only little code to ebitmap.h and actually simplifies things nicely in the MLS functions. Whether we are going to implement the cache or not, it is IMHO worth applying on its own.
Created attachment 1593709 [details]
Attaching the ebitmap patch. May contain bugs :)
Scratch build for F30 with the patch applied in progress here:
We've gotten clever with the ebitmap code in the past and I've ended up regretting it. I think there is benefit in keeping the ebitmap code as clean (tweak-free) as possible and instead implementing a proper cache to deal with the poor lookup performance.
Also, if you haven't learned this yet, let me make it clear that saying something is "going to take a lot more effort" has very little impact on my opinion :)
(In reply to Ondrej Mosnacek from comment #11)
> Michal, I'm unable to reproduce the perf measurements on Fedora (29 & 30).
> It looks like stream sockets are being used instead of datagram sockets in
> systemd-journal - is there some additional configuration change needed to
> trigger the problem?
Ahh, I made a copy-paste error in reproducer. I was comparing SOCK_STREAM vs. SOCK_DGRAM performance and obviously I've copied the wrong one here. Please replace systemd-cat with "logger" and then you should be able to observe the problem.
Once again, I am very sorry for causing confusion.
Please correct me if I didn't get this right, but aren't the two proposed approaches actually orthogonal. IOW, we should speed up how we compute the context as well as cache it?
Caching the lookup will effectively eliminate the need to add tweaks to the MLS ebitmap code, and should provide a greater performance as well (caches the entire secid->secctx translation, not just the MLS category field).
As I mentioned above, there is value in keeping the MLS/ebitmap code simple(ish), and the cache would appear to be the more "proper" solution anyway.
(In reply to Paul Moore from comment #16)
> Caching the lookup will effectively eliminate the need to add tweaks to the
> MLS ebitmap code, and should provide a greater performance as well (caches
> the entire secid->secctx translation, not just the MLS category field).
> As I mentioned above, there is value in keeping the MLS/ebitmap code
> simple(ish), and the cache would appear to be the more "proper" solution
I wouldn't want to overcomplicate the ebitmap code either, but I don't think the proposed patch does that. It merely adds a new helper, sticking closely to the logic and style of the existing code. There may be value in avoiding intrusive code changes in ebitmap.[hc], but not in avoiding changes that actually improve things.
Regardless of whether we implement the cache (and I do agree we should), pointlessly calling find_next_bit() 1024 times for what is one of the most common category sets (c0.c1023) is just ridiculous.
Comment on attachment 1593709 [details]
FWIW, I posted the patch properly to the list now:
We didn't originally keep a copy of the string in the sidtab for two reasons:
1) Memory cost - context strings can potentially be quite large and there can be many of them, keeping them all around forever could be problematic
2) Aliasing - there can be multiple string representations of the same security context due to user/role/type/sensitivity/category aliases, the compact category set notation (c0.c1023) versus enumerating them individually, differing category ordering, etc. Context strings have to be normalized (via security_context_to_sid() followed by security_sid_to_context() or string_to_context_struct() followed by context_struct_to_string() internally in order to be useful for comparisons.
When the raw/unmapped context support was added, we stored those because we cannot map them to a context struct since they aren't known to the loaded policy.
Presently the code presumes that if ->len or ->str are set then the context is unmapped, so if you start storing strings for them all, you'll need a separate field to indicate whether they are mapped.
Now that the patch has been posted upstream on the mailing list, I suggest we move the upstream related discussion there.
*********** MASS BUG UPDATE **************
We apologize for the inconvenience. There are a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 29 kernel bugs.
Fedora 29 has now been rebased to 5.2.9-100.fc29. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
If you have moved on to Fedora 30, and are still experiencing this issue, please change the version to Fedora 30.
If you experience different issues, please open a new bug report for those.
This issue is present also in the upstream kernel, so moving the bug to rawhide.
I posted a patch  that implements the SID -> string cache as the alternate approach suggested above. This patch reduces the overhead of security_context_to_sid() from 65% down to only 2% in the systemd-journald scenario provided in the description (cache hit ratio is almost 100% here, since the same context is used repeatedly).
A scratch build of a recent rawhide kernel with an only cosmetically different patch applied is available at . Note that this build uses the release kernel configuration instead of the debug one (slower) which is default for rawhide kernels.
Created attachment 1629248 [details]
journald flamegraph - SELinux enabled with string cache (rawhide w/release config)
Attaching a flamegraph for the scenario with scratch build 38524793 installed.
A patch that resolves this issue is now staged in the selinux/next branch:
Unless something bad happens, it should become part of the 5.6 kernel release.
This should now be resolved in rawhide.