Bug 893751 - audit breaks containers
audit breaks containers
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Paul Moore
Fedora Extras Quality Assurance
Depends On:
Blocks: 1070851
  Show dependency treegraph
Reported: 2013-01-09 15:44 EST by Dean Hunter
Modified: 2015-09-02 15:28 EDT (History)
30 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2015-09-02 15:28:42 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Dean Hunter 2013-01-09 15:44:33 EST
Description of problem:
While working my way through Lennart Poettering's series of articles on systemd, I had some problems spawning a namespace container. From Example 1 on http://0pointer.de/public/systemd-man/systemd-nspawn.html:

   # yum --releasever=17 --nogpgcheck --installroot ~/fedora-tree/ install yum passwd vim-minimal rootfiles systemd

raises two SELinux alerts:
   1.    The source process: /usr/sbin/groupadd
      Attempted this access: write
               On this file: /dev/null
   2.    The source process: /usr/sbin/useradd
      Attempted this access: write
               On this file: /dev/null

# systemd-nspawn -D ~/fedora-tree /usr/lib/systemd/systemd

causes an endless sequence of the following messages:
         Starting D-Bus System Message Bus...
[FAILED] Failed to start D-Bus System Message Bus.
See 'systemctl status dbus.service' for details.

Version-Release number of selected component (if applicable):
[root@server ~]# yum list installed yum
Loaded plugins: langpacks, presto, refresh-packagekit
Installed Packages
yum.noarch                         3.4.3-47.fc18                         @fedora
[root@server ~]# yum list installed systemd*
Loaded plugins: langpacks, presto, refresh-packagekit
Installed Packages
systemd.x86_64                      195-15.fc18                 @updates-testing
systemd-libs.x86_64                 195-15.fc18                 @updates-testing
systemd-sysv.x86_64                 195-15.fc18                 @updates-testing
[root@server ~]# 

How reproducible:
Comment 1 Lennart Poettering 2013-01-11 08:18:40 EST
Currently, the Linux audit layer is broken when it comes to containers. If CAP_AUDIT_WRITE/CAP_AUDIT_CONTROL is lacking all kinds of software will abort, including dbus and PAM. 

Auditing is not virtualized properly, hence granting CAP_AUDIT_WRITE/_CONTROL to a container is not a good idea, but due to the broken audit layer this will then cause dbus and PAM fail. If writing to audit fails with EPERM the audit code should just skip over it, not abort.

This is only broken on Fedora, not on Debian. Hence it works fine to run a Debian container on a Fedora host.

A hack around this is passing --capability=cap_audit_write,cap_audit_control to nspawn, which will allow the container to boot but the audit data it generates is useless.

Reassigning to audit, since there's nothing to fix here in systemd.
Comment 2 Richard Wall 2013-01-24 16:01:01 EST
I just encountered this problem too.

It would have been helpful if there had been a link to this ticket on the 0pointer man page for nspawn.

 * http://0pointer.de/public/systemd-man/systemd-nspawn.html

Also worth noting that on Fedora17, systemd-nspawn doesn't have the --capability option so there's no work around.
Comment 3 Steve Grubb 2013-01-28 11:26:41 EST
The problem here is one of container design. There are two ways to look at this.

1) If containers are viewed as process hardening, then do not put anything into them that requires auditing. The goal was separating the process from others.

2) If the containers are viewed as a light VM, then they are required to also have an audit daemon, collect certain properties of the VM, collect certain internal events, and forward events out of the VM to a permanent collector which the audit daemon can provide. Libvirt is also plumbed for all these requirements and it should be used for this purpose.

The audit code cannot skip over an EPERM return code. The requirements of the audit system are that if an event cannot be recorded, then the event must be stopped from occurring.
Comment 4 Lennart Poettering 2013-01-28 23:10:14 EST
Well, putting a Fedora in a OS container means the normal codepaths for Fedora userspace audit are used, and since the current userspace audit code is not capable of understanding that the lack of CAP_AUDIT_WRITE/CAP_AUDIT_CONTROL means that audit is not available, you currently cannot boot up a container with Fedora -- unless you grant it CAP_AUDIT_WRITE/CAP_AUDIT_CONTROL, at which time the container suddenly can muck around with the hosts' audit controls, which is a huge security problem...

Anyway, since this is all so broken and I don't really care about audit I have now began to document everywhere that people who want to use OS containers should just turn off audit with audit=0.

Of course that means that Fedora/RHEL won't support OS containers without altering the kernel command line, but I guess that's not really my problem...
Comment 5 Daniel Walsh 2013-02-06 09:50:57 EST
Why wouldn't we allow CAP_AUDIT_WRITE/CAP_AUDIT_CONTROL and have the kernel add something to the audit record to indicate that this came from a different namespace.  Then people could filter messages within the audit.log or even do stuff like having and audit dispatcher that would forward audit messages to a log which exists within the container.

Turning off audit in order to run containers, seems nuts.
Comment 6 Steve Grubb 2013-02-06 10:15:54 EST
Dan, you are right. Turning off audit when its _required_ is not the solution. As to the other statement, there is work on-going with kernel people to add a field to the event. But we also _have_ to be able to correctly identify the creation of the container and how it differs from its parent process.
Comment 7 Lennart Poettering 2013-03-07 20:57:42 EST
Well, auditing is still broken in three ways in containers:

a) audit messages from the container are not distuingishable from the host's messages

b) the container can muck with audit rules of the host.

c) When I open a new PID namespace the sessionid/loginuid is not reset even though I just opened an entirely new container. The loginuid leaked from the host makes no sense at all of course in the container, and confuses the hell out of pam_loginuid, and the audit tools

d) There's no way how we could turn off auditing in a container, but leave it on on the host.
Comment 8 Daniel Walsh 2013-03-08 14:10:05 EST
A: This is being worked on with the kernel.  Including initpid in the audit message, any initpid != 1 would have come from a container.
Also if there is proper audit messages on start and stop (As libvirt) is doing, you should be able to gather all the audit messages for a particular container based on the initpid.

B: is currently being blocked if you use SELinux.

c:  I believe this should be fixed.  Changing the pid namespace should reset loginuid and sessionid to -1.

d: Don't know the answer for this, or if it is something a user would want.
Comment 9 Lennart Poettering 2013-05-09 09:56:17 EDT
Any update on this one? Any chance we get A and C fixed for F19 at least?
Comment 10 Aristeu Rozanski 2013-05-09 10:16:25 EDT
We still need to hear from Steve Grubb of the requirements. Without comment #3
answered, we can't implement anything. I sent a draft to allow having a single
auditd and recently Gao Feng submitted a patchset to implement an auditd per
container tied to userns.
Comment 11 Daniel Berrange 2013-05-09 10:31:03 EDT
See also this mail thread on the problems with PAM + LXC http://lists.freedesktop.org/archives/systemd-devel/2013-May/010944.html
Comment 12 Eric Paris 2013-05-09 12:58:41 EDT
a) there is an upstream patch posting from aris, but needs more work

b) you shouldn't need to give CAP_AUDIT_CONTROL inside the container

c) not gunna happen.  If you launch the container by hand, you are going to have to change pam to remove pam_loginuid.  but things should 'just work' if you launch from systemd or libvirt from systemd etc...

d) not something we want

e) the kernel still rejects with EPERM any message is pid_ns != init_pid_ns.  It's a bug we need to fix in kernel.
Comment 14 Lennart Poettering 2013-05-09 17:26:56 EDT
(In reply to comment #12)
> a) there is an upstream patch posting from aris, but needs more work
> b) you shouldn't need to give CAP_AUDIT_CONTROL inside the container

Well, iirc writing to loginuid requires this...

> c) not gunna happen.  If you launch the container by hand, you are going to
> have to change pam to remove pam_loginuid.  but things should 'just work' if
> you launch from systemd or libvirt from systemd etc...

Wow, that's just sad. I guess I'll then 
make it more prominent in the nspawn docs that auditing needs to be disabled for nspawn to work correctly.

nspawn is almost always started from a shell, so you basically break nspawn entirely with this. Heck, for testing purposes you Daniel also runs libvrit-lxc from a shell, so you make his life really hard too.

I was trying to get management to make running RHEL in containers cleanly without any manual a release goal. I guess I can forget this now if audit stays broken like this.

But well, if this is how it is then I'll instead just document everywhere that auditing breaks things...

You know, before the the sealing off was enabled in the kernel we could still reset loginuid right after setting up the namespace, but now even that's gone...
Comment 15 Lennart Poettering 2013-05-09 18:23:11 EDT
I have committed this now to systemd:


This should be helpful to the user and help him around this usability nightmare...

This suggests audit=0 on the man page, in the README and when you run nspawn on a kernel where audit is enabled. It's certainly better to complain loudly than to just allow the user spawn containers he can't log into because audit can't handle this.
Comment 16 Lennart Poettering 2013-05-09 18:25:13 EDT
(In reply to comment #12)

> c) not gunna happen.  If you launch the container by hand, you are going to
> have to change pam to remove pam_loginuid.  but things should 'just work' if
> you launch from systemd or libvirt from systemd etc...

BTW, to make this very clear: with systemd we support booting the same OS image on bare metal, on VMs and in containers without *any* alteration, and it needs to work the same way in all three cases. That's why we are not OK with asking the user to patch around in PAM files or anything like that. That's simply not an option for us.
Comment 17 Steve Grubb 2013-05-09 19:24:54 EDT
Is it not possible to write the utility to start the container via systemd? Is starting it directly by clone the only possible way to write this software?

To be very clear, the loginuid has to be as tamper-proof as possible. If we open a hole for containers to reset the loginuid, then we also open a hole for abuse by people avoiding detection.
Comment 18 Steve Grubb 2013-05-09 19:43:48 EDT
Btw, it seems like an incongruity to start daemons from a clean environment like systemd but not doing the same thing for a container. :-)
Comment 19 Lennart Poettering 2013-05-10 08:23:05 EDT
Well, Steve, in the container the loginuid of the host makes no sense at all.

If I read /proc/self/loginuid of any of the processes in the container it returns the UID of the host, which doesn't even make any sense... This is so broken, it hurts.
Comment 20 Steve Grubb 2013-05-10 08:34:49 EDT
That's a good reason why the container should be started via systemd even if it looks like a shell command. You would also have the container inheriting environmental variables, process group, session membership, supplementary group IDs, alarms, umask, process signal mask, pending signals, rlimits, etc.

Starting from systemd would probably give you a more reliable startup. It would also prevent the problem of loginuid bleeding over.
Comment 21 Eric Paris 2013-05-13 10:42:10 EDT
(In reply to comment #15)
> I have committed this now to systemd:
> http://cgit.freedesktop.org/systemd/systemd/commit/
> ?id=7ecec4705c0cacb1446af0eb7a4aee66c00d058f

Wrong commit id?  Seems unrelated.

> This should be helpful to the user and help him around this usability
> nightmare...
> This suggests audit=0 on the man page, in the README and when you run nspawn
> on a kernel where audit is enabled. It's certainly better to complain loudly
> than to just allow the user spawn containers he can't log into because audit
> can't handle this.

Lennart, think about what is going on.  Audit lives outside the container.  That might change, fujitsu is actually working on that, but for now, auditd is a host thing and is only usable accessible to the real host.  So we need to look at this as a global security thingie.  Globally, if the namespaces and cgroups are set up by a logged in user, the loginuid of all of those children processes IS that admin which started things.  You shouldn't be allowed to change it.  Who the hell knows what that admin set up?  We have no way to track or record that information.

Now if nsspawn actually was just a wrapper to kick back through systemd, it would be systemd which launched the 'container.'  Now we have systemd which can track, record, and make sure things were set up intelligently.  The loginuid wouldn't be set.  So now you don't need CAP_AUDIT_CONTROL inside the 'container'.

If you launch a 'container' using an intelligent tool, like systemd or libvirt, we are in violent agreement.  Things should just work.  If you hack shit up by hand you get to hack shit up by hand until it works.  How is that a usability nightmare?  audit=0 being needed is absolutely wrong.  Please lets fix the workflow and remove such silliness as you describe about disabling audit...

As to your point that /proc/self/loginuid not making sense inside the container, I agree.  I'll look into making that output local to the readers namespace.  Should be a pretty easy patch.  Can someone open a BZ?
Comment 22 Michal Schmidt 2013-05-13 11:00:51 EDT
(In reply to comment #21)
> Wrong commit id?  Seems unrelated.

Yes. This is the commit Lennart meant:
Comment 23 Eric Paris 2013-05-13 11:11:22 EDT
Thanks.  Yeah, those comments are just absolutely wrong.  If audit and systemd containers don't just work something is wron.g  It audit + containers by hand don't just work, you get to hold the pieces.  Lennart, can we please fix the comments?  If you'd like to discuss what I'm thinking and why let me know, I feel like there must be some misunderstanding between us right now...
Comment 24 Lennart Poettering 2013-05-13 11:27:25 EDT
nspawn is primarily this tool for admins and developers that just allows you to quickly boot-up my machine fromn the command line, that's it. What you are asking me to do means basically turning nspawn into another libvirtd (which makes no sense, we already have that, in libvirtd...). 

So nspawn is precisely about being able to quickly run from a command line, and you guys actively make that impossible via audit. 

You know, I first tried to get PAM fixed to simply skip audit stuff when the audit caps are missing, so that we can simply drop these caps from a container and everything would work fine. But no, this got blocked by Steve. Steve's just too married to the idea that audit should actively break things, if possible...

Then, my next attempt was to get loginuid to be reset for containers by the kernel, implicitly (what this bug is about). But this was blocked.

My next idea was to reset it manually, but that's blocked too, since the loginuid is sealed now in the kernel...

This has been going on for a year now or so. With the systemd commit I made I simply tried to improve things for the user, since it's documented now why these things fail, and how to work around them. Given that nspawn is a tool for admins/developers, I think it's the best thing to do for now.

You know, I don't really care whether the kernel resets loginuid entirely when a container is opened, or whether it only hides the field from userspace in the container, or whether the userspace audit code simply ignores this data if it runs from a container, but the status quo of simply exposing loginuid in the container, and having the audit userspace naively believe that data is just broken, and that's a fact. And I wished you guys could see that...
Comment 25 Lennart Poettering 2013-05-13 11:34:45 EDT
Some other wonderful hacks I came up with while trying to work-around the broken audit stack with containers are these:

- Add another PAM module that runs before pam_loginuid, and when it detects that it is run in a container simply mounts a file from /tmp to /proc/self/loginuid. That should be enough to trick pam_loginuid to not fail. This one gets extra points for being ugly...

- Use seccomp to actually make socket(AF_NETLINK, SOCK_RAW, NETLINK_AUDIT) return -ENOTSUP or so (or whatever the right error code is) in the container, so that pam_loginuid is tricked into believing audit is off in userspace... Also ugly, but less so... 

A combination of both more or less trick all audit userspace into thinking no audit kernel support was available, and should make things work...
Comment 26 Eric Paris 2013-05-13 11:45:32 EDT
Note that kernel auditing is broken when used with systemd's
container code. When using systemd in conjunction with
containers please make sure to either turn off auditing at
runtime using the kernel command line option "audit=0", or
turn it off at kernel compile time using:


This is just wrong.  audit + systemd + container should work just fine.  It is audit + container WITHOUT systemd which is not working.


maybe a fix would be to allow nsspawn, with CAP_AUDIT_CONTROL, to unset the loginuid.  It also means that we don't have to leak CAP_AUDIT_CONTROL into the container.  Only the setup programs need it.

steve, do you think we can craft security goals around that solution?  it would mean that a container, launched by nsspawn (or maybe virt-sandbox) would be tracked as if they were launched by systemd or libvirt...
Comment 27 Daniel Walsh 2013-05-15 13:12:58 EDT
I have been thinking about this.  Steve wants loginuid to be immutable in order to make sure no user could change his loginuid to do something bad, that the audit system can not track.  But Steve you acknowledge that turning off audit does the same thing, but you can not stop an evil admin.  And to go along with this the audit subsystem would be able to load the fact that said evil admin has disabled audit.  Take him out to the woodshed and beat him.

Why not just make changing an existing (non -1) loginuid an auditable event.  Then if evil admin changes his loginuid you can also take him to the woodshed.

The current setup with pam_loginuid refusing to allow users to login if the loginuid is set will break anyone debugging a login program like sshd.

gdb /usr/bin/sssd
> r

Switch to another window attempt ssh localhost; Fail loginuid is set.
Comment 28 Eric Paris 2013-05-15 13:22:34 EDT
steve and I discussed this some more.  I have now written a patch to allow a untility to UNSET its loginuid if it has cap_audit_control AND the audit failure mechanism is not set to panic.  In an environment where people want the system to panic rather than lose and audit log, the absolutely immutable nature of things might have some use.  In normal environments we have the attack dan is describing and so unsetting the loginuid BEFORE the actual authentication comes along trying to set it to something else makes sense.

I haven't tested, but nsspawn should be able to just echo "-1" > /proc/self/loginuid before it launches the container and everyone should be happy(ish)
Comment 29 Steve Grubb 2013-05-15 13:57:13 EDT
That's close to what I was thinking but a detail needs clarifying. Right now we have a compile time choice between the old way of requiring CAP_AUDIT_CONTROL and the new way that does not require capabilities, but its immutable. What I was trying to say in that discussion is how about we make it a runtime choice instead. We could coopt the audit= boot parameter so that we can choose which method at boot.

I was thinking that we can make it audit=2, which means audit enabled and loginuid are immutable. But Eric reminded me that we can already set -e 2 via auditctl and that means AUDIT_LOCKED. However, at boot it does not make sense to say make the rules immutable.

So there are 2 approaches...either map audit=2 into a new variable during audit_init so that we have a flag to see if loginuid is immutable. Or, maybe it could be bit mapped where 1 = enabled, 2 = rules immutable, 4 - loginuid immutable....except we current make 2 imply that 1 is set. I really don't care which way. The main point is making a runtime choice instead of compile time choice between the two currently existing methods. But it should not be tied to the audit failure mechanism. Thanks.
Comment 30 Steve Grubb 2013-05-15 13:59:32 EDT
And one last thought, the runtime choice should be immutable, meaning it can only be chosen at boot and never switched for any reason without a reboot.
Comment 31 Eric Paris 2013-05-15 15:35:44 EDT
kernel command line flags are getting more and more highly discouraged in the community, especially something as difficult to locate/google/find as a bitmap.  I'd be agreeable to a new auditctl command/message...

maybe a SETOPTIONS bitmap?  with bits for setting and locking?  we can make loginauid setable and lockable?
Comment 32 Steve Grubb 2013-05-15 16:07:15 EDT
Doing it by auditctl is too late - not to mention I don't want to make people think that they can change it anytime they want. It really needs to be at boot time.
Comment 33 Eric Paris 2013-05-15 16:26:47 EDT
Wait, if we don't trust auditctl during bootup (aka, this is audit policy and should go in audit.rules) we are in a WHOLE lot more trouble than loginuids.

I suggest we follow the prctl SECBIT_* and SECBIT_*_LOCKED as discussed in the capabilities man page.  In the CC config you'd set both the loginuid_immutable and the loginuid_immutable_locked bits.

It is similar to -e2, but you can have only 1 bit of a flags/features register locked at a time.  I'd think this entire register should be locked by -e2
Comment 34 Steve Grubb 2013-05-15 16:34:46 EDT
Its not a question of trust. It can simply be too late. Prctl is just wrong. That means a process can have a different security policy than the system. We don't want that kind of inconsistency in user/subj binding. It should be all one way or all another way. Doing audit=2 will be easy to test for via SCAP. This will be something that needs to be tested for in security check lists.
Comment 35 Eric Paris 2013-05-15 16:52:24 EDT
You haven't described why setting this while parsing audit.rule is too late.  From my point of view this is audit configuration.  It belongs in audit.rules.  Please can you describe the problem, not a solution.

I wasn't saying to use prctl.  sorry for the misunderstanding.  It was just an example of the design pattern I was hoping to explain.  Please ignore prctl and any notions you have about it.

I believe the best way to handle this would be a new interface.  The option to turn this on should live in audit.rules.  Just like every other audit option.  It also means that your scanner should be able to look for it.  If the scanner can't handle the audit.rules you are already in a world of additional pain.  agreed?

It's possible we could use an AUDIT_SET message with a couple of new mask values


Although it seems that the kernel code around struct audit_status is so poorly written it's difficult to grow it to support more bits.  Not impossible, but the second time I see binary structures in the kernel<->userspace audit code that weren't well designed.

So maybe a whole new message type AUDIT_SET_FEATURES or AUDIT_GET_FEATURES makes more sense...
Comment 36 Fedora End Of Life 2013-12-21 05:17:18 EST
This message is a reminder that Fedora 18 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 18. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '18'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 18's end of life.

Thank you for reporting this issue and we are sorry that we may not be 
able to fix it before Fedora 18 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior to Fedora 18's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.
Comment 37 Lennart Poettering 2013-12-22 09:37:13 EST
Sooo, any update to this? Can we get a fix for this, maybe as a christmas present?
Comment 38 Timothée Ravier 2014-01-18 20:54:01 EST
As far as I understand, the following commits in 3.13 will be part of the solution:

They allow processes with CAP_AUDIT_CONTROL to unset the loginuid at will. Thus systemd-nspawn/lxc/... could get this capability and unset the loginuid right before launching the container.
Comment 39 Eric Paris 2014-03-20 10:54:36 EDT
This got us 99% of the way there.  I think at least.  Small brain.  The remaining problem was that one of the pam modules, heck if I remember which one, sent an userspace audit message.  Since the message came from a non-init pid namespace, the kernel was rejecting the message (actually it was rejecting opening the socket as I recall)

The 3.15 kernel will allow userspace message from the non-init pid namespace.  So that last part should be taken care of in 3.15...
Comment 40 Daniel Walsh 2014-03-22 06:31:38 EDT
Can we get these fixes into RHEL7?  We have had occurrences where people are trying to run sshd within docker and failing.
Comment 41 Marcel Wysocki 2014-04-25 13:28:23 EDT
+1 it would be great to get this into RHEL7
Comment 42 Paul Moore 2014-11-06 21:20:27 EST
Let's bring this back to life for a moment in an attempt to reach some sort of conclusion ...

Upstream is fine starting with 3.15, yes?

RHEL7 is broken and likely needs the 3.13 commits (see comment #38) and a few from 3.15, yes?
Comment 43 Richard Guy Briggs 2014-11-12 17:51:01 EST
I believe this was inadvertantly fixed by a backported patch to fix 1010455 (947530 needed an extra step):
Comment 44 Paul Moore 2015-09-02 15:28:42 EDT
It appears that upstream/Fedora are resolved and RHEL7 has already been fixed via other BZs; let's close this out as CLOSED/UPSTREAM.

Note You need to log in before you can comment on or make changes to this bug.