RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1769299 - systemd: systemd-nspawn disables pkey_alloc system call by default
Summary: systemd: systemd-nspawn disables pkey_alloc system call by default
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: systemd
Version: 8.2
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 8.0
Assignee: systemd-maint
QA Contact: Frantisek Sumsal
URL:
Whiteboard:
Depends On: 1770154
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-06 10:47 UTC by Florian Weimer
Modified: 2021-05-06 07:30 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-06 07:30:24 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Florian Weimer 2019-11-06 10:47:05 UTC
It seems that systemd-nspawn disables the pkey_alloc system call by default via seccomp, causing it to fail with EPERM. Since this system call is used for userspace security hardening, filtering it is likely to reduce security, rather than increasing it.

Observed with: systemd-container-239-19.el8.x86_64

Comment 1 Lennart Poettering 2019-11-08 11:29:34 UTC
what is pkey_alloc() used by? is this something any common lib implicitly invokes? glibc?

Comment 2 Lennart Poettering 2019-11-08 11:37:13 UTC
hmm, judging from https://codesearch.debian.net/search?q=pkey_alloc&literal=1&perpkg=1 noone is using pkey actually? if noone is, then it's probably little tested hence there's a good chance it has issues.

Comment 3 Florian Weimer 2019-11-08 11:48:36 UTC
It's supposed to be used by in-memory databases in combination with DAX, to avoid corrupting persistent storage accidentally, and perhaps by certain cryptographic modules, to protect key material.

I don't think there is much open-source software using it.

Comment 4 Lennart Poettering 2019-11-08 12:38:31 UTC
but isn't this something that would-be users should enable explicitly with --system-call-filter=@pkey rather than enable for everyone by default?

i.e. the whole syscall filter thing is an exercise in minimizing attack surface, and something little used like this probably tips the balance more to the side of "opt-in" rather than "opt-out".

Out of curiosity: how did you run into this btw, where did you notice this?

(btw, afaics docker blocks it too: https://docs.docker.com/engine/security/seccomp/)

Comment 5 Lennart Poettering 2019-11-08 12:40:34 UTC
btw, none of the CPUs I am across support pkeys (they lack the "pku" flag in /proc/cpuinfo at least). Is this available in any current Intel CPUs?

Comment 6 Florian Weimer 2019-11-08 12:54:14 UTC
(In reply to Lennart Poettering from comment #4)
> but isn't this something that would-be users should enable explicitly with
> --system-call-filter=@pkey rather than enable for everyone by default?

I don't think users should be aware of individual system calls, it should just work. I'm not sure why systemd-nspawn tries to arbitrarily block system calls by default. Reducing attack surface is one thing, but breaking the userspace ABI is not something that users will expect.

> Out of curiosity: how did you run into this btw, where did you notice this?

It shows up when running the glibc test suite.

(In reply to Lennart Poettering from comment #5)
> btw, none of the CPUs I am across support pkeys (they lack the "pku" flag in
> /proc/cpuinfo at least). Is this available in any current Intel CPUs?

Some Xeon Scalable Processors have support, but not all Skylake server processors.

The EPERM vs ENOSYS difference causes failures even without pkeys support in the CPU.

Comment 7 Lennart Poettering 2019-11-08 15:02:54 UTC
> The EPERM vs ENOSYS difference causes failures even without pkeys support in the CPU.

So I'd claim we are right with returning EPERM here. I mean, this is a security profile, and EPERM sounds like the more appropriate error for that. ENOSYS sounds like the error to return for "not implemented", but in this case it might very well be implemented, but it's forbidden due to the selected policy. Or to say this differently: the syscalls policies nspawn enforces are more like selinux' policies that prohibit access to APIs and objects, which also use EPERM not ENOSYS. Yes, you can use seccomp for anything, but philosophically these policies are really about security, not about hiding functionality, and I don't think we should lie about that, it just makes stuff hard to debug.

In systemd's own codebase, when when we use new fancy syscalls we generally assume EPERM could also mean "security policy doesn't allow this", and then implement a fallback to something else, much the same as for ENOSYS.

btw, docker's seccomp policies also return EPERM for blocked calls, exactly like we do.

Comment 8 Florian Weimer 2019-11-08 15:18:45 UTC
(In reply to Lennart Poettering from comment #7)
> > The EPERM vs ENOSYS difference causes failures even without pkeys support in the CPU.
> 
> So I'd claim we are right with returning EPERM here. I mean, this is a
> security profile, and EPERM sounds like the more appropriate error for that.
> ENOSYS sounds like the error to return for "not implemented", but in this
> case it might very well be implemented, but it's forbidden due to the
> selected policy.

They serve different purposes. EPERM is appropriate if you want things to fail (so that applications break), ENOSYS is appropriate if you want to trigger fallback (like utimensat_time64 → utime) or just disable the feature (because the application assumes the kernel is too old to support it). For a generic container runtime, there either have to be no filters by default (my preference), or filters for unknown system calls need to return ENOSYS. Everything else will break too many applications.

If you have specific knowledge of the system call, you can return EPERM instead in a few cases (e.g. for clock_settime). But that's not really possible for an unknown system call.

Comment 11 RHEL Program Management 2021-05-06 07:30:24 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.


Note You need to log in before you can comment on or make changes to this bug.