Bug 205161

Summary: 11.5x performance degredation with SELinux on ext3
Product: [Fedora] Fedora Reporter: Bill Nottingham <notting>
Component: kernelAssignee: Eric Paris <eparis>
Status: CLOSED CURRENTRELEASE QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 9CC: eparis, esandeen, jmorris, kernel-maint, k.georgiou, rvokal, sct, sdsmall, sgrubb, tmraz, triage, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard: bzcl34nup
Fixed In Version: F9 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-06-04 19:40:12 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 243490    
Bug Blocks: 212651, 235704    
Attachments:
Description Flags
oprofile report
none
First shot at hashing the context to sid conversion none

Description Bill Nottingham 2006-09-05 04:10:37 UTC
Description of problem:

SELinux makes opening not-in-cache mailboxes abysmally slow.

Version-Release number of selected component (if applicable):

2.6.17-1.2617.fc6

How reproducible:

Every time.

Steps to Reproduce:
1. Reboot cleanly with SELinux into single-user mode
2. Run 'mutt' & immediately quit
3. Reboot cleanly with 'selinux=0' into single-user mode
4. Run 'mutt' & immediately quit
  
Actual results:

With SELinux:

real    1m8.423s
user    0m0.500s
sys     0m1.352s

Without SELinux:

real    0m5.906s
user    0m0.456s
sys     0m1.032s

That is 1158.5% slower in the SELinux case.

Expected results:

Roughly similar times. At best, 5%? 10% worse?

Additional info:

Mailbox is maildir format, 4084 messages.

Comment 1 Dave Jones 2006-09-05 17:09:14 UTC
ugh, that is really bad.
Any chance you can get some oprofile data on this ?


Comment 2 Bill Nottingham 2006-09-05 17:42:57 UTC
Looking at it a different way, the mailbox in total is just short of 80MB, which
yields a throughput of 13.5MB/sec in the non-SELinux case, and 1.1MB/sec in the
SELinux case - does something SELinux does make the I/O layer very pessimized
for small, continuous requests?

Comment 3 James Morris 2006-09-05 17:57:05 UTC
It'd be interesting to see the output of 'avcstat 1' for the duration of the
test with SELinux enabled, as well as /selinux/avc/hash_stats before and after.

Also, please check the audit logs etc. in case it's generating a lot of avc
messages.

Comment 4 Bill Nottingham 2006-09-05 18:48:42 UTC
hash_stats before:

entries: 502
buckets used: 307/512
longest chain: 6

hash_stats after:

entries: 499
buckets used: 300/512
longest chain: 5

avcstat 1:

   lookups       hits     misses     allocs   reclaims      frees
    254774     253973        801        801        272        290
       121        121          0          0          0          0
         1          1          0          0          0          0
         1          1          0          0          0          0
         8          8          0          0          0          0
         1          1          0          0          0          0
         9          9          0          0          0          0
        57         57          0          0          0          0
        21         21          0          0          0          0
       674        671          3          3         16         16
      1821       1821          0          0          0          0
      1256       1256          0          0          0          0
      1242       1242          0          0          0          0
      1205       1205          0          0          0          0
      1200       1200          0          0          0          0
      1096       1096          0          0          0          0
      1091       1091          0          0          0          0
      1177       1177          0          0          0          0
      1211       1211          0          0          0          0
      1461       1461          0          0          0          0
      1095       1095          0          0          0          0
       840        840          0          0          0          0
   lookups       hits     misses     allocs   reclaims      frees
       302        302          0          0          0          0
      1160       1160          0          0          0          0
      1218       1218          0          0          0          0
       882        882          0          0          0          0
      1204       1204          0          0          0          0
      1160       1160          0          0          0          0
      1212       1212          0          0          0          0
      1172       1172          0          0          0          0
       919        919          0          0          0          0
      1038       1038          0          0          0          0
      1158       1158          0          0          0          0
      1178       1178          0          0          0          0
      1252       1252          0          0          0          0
       953        953          0          0          0          0
      1160       1160          0          0          0          0
      1331       1331          0          0          0          0
      1109       1109          0          0          0          0
      1192       1192          0          0          0          0
       978        978          0          0          0          0
      1136       1136          0          0          0          0
      1072       1072          0          0          0          0
      1304       1304          0          0          0          0
   lookups       hits     misses     allocs   reclaims      frees
      1168       1168          0          0          0          0
       818        818          0          0          0          0
      1107       1107          0          0          0          0
      1061       1061          0          0          0          0
      1240       1240          0          0          0          0
      1191       1191          0          0          0          0
       970        970          0          0          0          0
       883        883          0          0          0          0
      1092       1092          0          0          0          0
      1150       1150          0          0          0          0
       987        987          0          0          0          0
       795        795          0          0          0          0
       994        994          0          0          0          0
      1006       1006          0          0          0          0
      1165       1165          0          0          0          0

As for avc messages, none in dmesg (was testing in single user mode to avoid
noise in the data, so pre-auditd.)

Comment 5 James Morris 2006-09-05 19:03:37 UTC
These numbers all look normal.

Comment 6 Karl MacMillan 2006-09-05 19:12:17 UTC
Is the audit daemon enabled and did you check in /var/log/audit/audit.log for
avc messages?

Comment 7 Bill Nottingham 2006-09-05 19:20:35 UTC
See end of comment #4. :)

Comment 8 Bill Nottingham 2006-09-05 19:27:42 UTC
Created attachment 135598 [details]
oprofile report

Here's the oprofile report. check_poison_obj seems high.

Comment 9 James Morris 2006-09-06 04:33:24 UTC
What does strace -fc (and maybe ltrace) show with and without SELinux enabled?


Comment 10 Bill Nottingham 2006-09-06 19:55:13 UTC
With SELinux:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 82.98    0.601081         139      4312        11 stat64
 14.36    0.104035          25      4192           read
  1.99    0.014420           3      4205        13 open
  0.26    0.001860           2       830           write
  0.24    0.001766           9       194           getdents64
... (rest elided)

Without SELinux:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 43.31    0.035156           8      4192           read
 24.72    0.020063           5      4312        11 stat64
 15.75    0.012784          66       194           getdents64
 12.30    0.009982           2      4205        13 open
  2.63    0.002137           3       830           write
  0.38    0.000305           0      4111           munmap
  0.31    0.000255           0      4172           mmap2
  0.25    0.000204           0      8282           fstat64
  0.18    0.000145           0      4091           _llseek
  0.17    0.000142           0      4194           close
...

The obvious thing that stands out is the increased stat & read times, of
course.

Comment 11 Eric Paris 2006-10-20 17:31:11 UTC
Created attachment 139006 [details]
First shot at hashing the context to sid conversion

This patch reuses the data in the sidtab.  It actually creates a second htable
called con_htable and the nodes inside the table allow for a second list
(con_next instead of just next)

for testing purposes /proc/sys/kernel/contab has been included.  If set to 0 it
should use the original search over all entries in the sidtab.	If set to 1 it
should instead try to search the parallel contab.  If results are positive the
tunable will be dropped for submission anywhere.

Comment 12 Stephen Smalley 2006-10-20 17:48:43 UTC
Watch out for convert_context() in services.c.  Remaps the context structure
values upon a policy reload, thereby changing your hash keys in your table.

Comment 13 Eric Paris 2006-10-20 19:09:37 UTC
So I've been looking at this for just a second.  I originally put a fix in
sidtab_map_remove_on_error that just removed every 'entry' from the contab and
the reinserted it after the apply.  Then I decided to change it and instead
remove on apply() failure or on the hash bucket changing.  But my question
becomes, is it worth it to conditionally remove the entry from the contab?  How
often do we remap contexts but not actually have any of the user/role/type
change?  Would that be so common on policy reload that I should care about
saving that little bit of time?  I might just go back to unconditionally
removing and reinserting only when needed, makes the code shorter even if it
will be a waste of time when loading new policy...

As a side not I figured I'd give you the stats I see.  Using a pretty up to date
fedora system and policy I'm getting the following after semanage -R

sids:  445 entries and 128/128 buckets used, longest chain length 4
contab has 123/128 buckets used, longest chain length 24

So we do use almost all of the buckets and our longest search went from 445 to
24.  I'd say that's a pretty good cut in worst case performance.  Do you think I
would help my distribution at all if I added in something from the MLS field to
get our hash key?  Do you think it even matters?  (oh yeah, and no fair having a
nice linearly distributed thing like a sid to use as the key)

Comment 14 Stephen Smalley 2006-10-20 19:39:29 UTC
- As load policy is not performance critical, always removing and re-inserting
the  entry is no big deal.  The user and role values will tend to remain stable,
but e.g. inserting a policy module might perturb the type value space.
- On the hash function, keep in mind that there will typically be a small number
of distinct SELinux users and roles but a large number of types.  The MLS levels
won't show up unless you are actually using multiple levels, which won't be the
case in a default install (until someone actually defines categories and uses
them for MCS, or in the MLS/LSPP case).  Or you could use jhash_nwords() from
include/linux/jhash.h and see how it performs.
- The real question of course is whether this makes any perceptible difference
in the performance overhead as described in the bug.


Comment 15 Bill Nottingham 2006-10-20 20:29:20 UTC
Test kernel:

real 1m14.009s
user 0m0.400s
sys 0m0.500s

Test kernel w/selinux=0:

real 0m5.240s
user 0m0.400s
sys 0m0.308s

In other words, no real improvement.

What additional IOs does SELinux generate on stat, open, or read?

Comment 16 Stephen Smalley 2006-10-20 20:53:00 UTC
When a dentry is first instantiated for the inode, SELinux has to invoke the
->getxattr method of the inode to obtain its security attribute, which is then
mapped to a SID and cached in the incore inode security struct.  Once that is
done, there should be no further additional IO; just a permission check, unless
of course you are generating audit messages due to denials.  Or if using
in-inode EAs (1), the getxattr would presumably not involve any additional IO at
all.

(1) http://marc.theaimsgroup.com/?l=git-commits-head&m=110583624019158&w=2

Comment 17 Bill Nottingham 2006-10-20 21:01:37 UTC
Hm, ok. If the EA is in-inode, that *shouldn't* disrupt the IO pattern, I would
think. But something obviously is.

Comment 18 Bill Nottingham 2006-10-20 21:30:41 UTC
After talking with Eric Sandeen, it seems that EAs are *not* in-inode on ext3 by
default. So, this could explain the IO pattern changes. I'll attempt a test on a
system with bigger inodes to see if it makes a difference.

Comment 19 Bill Nottingham 2006-10-23 19:37:45 UTC
A test of a system with in-inode EAs doesn't seem to show much difference, but
it is a somewhat disparate system (different processor, storage, etc.)

Comment 20 Bill Nottingham 2006-10-27 20:38:31 UTC
So, being unable to reliably reproduce this on fresh systems, I've investigated
the FS I'm seeing it on.

For all the mail files in question (one per message), I see FS information like:

debugfs: stat
/home/notting/mail/other/cur/1040422194.11527_0.devserv.devel.redhat.com,U=
1505:2,S
Inode: 5670539   Type: regular    Mode:  0600   Flags: 0x0   Generation: 3229277588
User:  2166   Group:  2167   Size: 3389
File ACL: 6223838    Directory ACL: 0

debugfs: stat
/home/notting/mail/other/cur/1141659085.2391_0.devserv.devel.redhat.com,U=3
743:2,RS
Inode: 5672771   Type: regular    Mode:  0600   Flags: 0x0   Generation: 3229279820
User:  2166   Group:  2167   Size: 2113
File ACL: 6225525    Directory ACL: 0

...

So, the ACLs are out of inode, plus they aren't particularly local to the
file in question. If my math (and understanding of the numbers) is to be
trusted, they're about 2GB apart on the disk, in general.

Therefore, for each file read/stat, the getxattr has to seek 2GB across the
disk, and then seek back for the next file. This would *have* to slow things
down tremendously.

A freshly created filesystem probably wouldn't have these locality issues, and
therefore probably wouldn't show this.

In any case, we may want consider bumping the default inode size for ext3 so
that xattrs are in-inode, just to avoid the chance of machines degrading to this
state. CC'ing ext3 gurus.

(An interesting side effect of this is that, if I understand it right, the
longer you'd run a system like this without SELinux, the more likely the relabel
is to have to create xattrs in faraway disk blocks, making the problem worse...)

Comment 21 Stephen Tweedie 2006-10-27 21:00:03 UTC
EAs are not in inode, but there _should_ be sharing so that multiple related
files  with the same EAs get the same storage.  Is that mechanism failing here
such that each inode is getting its own EA block, perhaps?

Comment 22 Bill Nottingham 2006-10-27 21:08:44 UTC
(debugfs log is debugfs stat output for every file)

[notting@nostromo: /tmp]$ grep "File ACL" debugfslog | wc -l
4188
[notting@nostromo: /tmp]$ grep "File ACL" debugfslog | sort -u | wc -l
4087

Doesn't appear to be much sharing.

Comment 23 Karl MacMillan 2006-10-27 21:36:16 UTC
Couple of questions:

1) The selinux xattrs are different from any ACLs on the file - are you certain
that those debug messages are giving information about the selinux labels?

2) When considering a larger default values, we need to consider all xattrs not
just selinux contexts, correct? That includes beagle and other heavy xattr users.

As for the selinux component, most contexts are not that long though of course
there is no real restriction. A long example would be:

system_u:object_r:httpd_unconfined_script_exec_t:s0-s0:c0,c12,c15

Only ~70 characters.

Comment 24 Eric Sandeen 2006-10-27 21:43:25 UTC
Hmm.  one other twist here; ext3 -should- be sharing a single acl filesystem
block for up to 1024 files that share the same attribute.  This worked in a very
simple non-selinux attr test I did... but in notting's case it's not.  Proper
sharing could well resolve the performance impact, w/o needing to change the
filesystem geometry.  I will look into this...

Comment 25 Bill Nottingham 2006-10-27 21:46:58 UTC
1) How would I tell? I haven't set any ACLs up by hand, if that's what you mean.

Comment 26 Eric Sandeen 2006-10-28 00:39:32 UTC
were the stats in comment #22 for a complete filesystem, or a directory within a
larger filesystem?

Comment 27 Bill Nottingham 2006-10-28 03:02:47 UTC
A single directory (holding the mail in question that's the test case here.)

Comment 28 Eric Sandeen 2006-10-28 03:10:17 UTC
Hm, so it's possible that there is good sharing, but sharing across the
filesystem and not in a single directory; I'll have to look at how the
hashing/sharing algorithm works, and how the relabeler works... I'd expect it to
do things in roughly directory order, and wind up sharing blocks that same way...

Comment 29 Eric Sandeen 2006-10-30 18:32:29 UTC
re: comment #23, the block-sharing tests Bill did should be accurate; although
the selinux xattrs are a different namespace, they are stored in the same way as
other xattrs, in the same on-disk locations.

I did a test on a freshly installed RHEL5 snapshot, and of the first 32k or so
system files I checked, I found 147 unique policies, and 208 unique acl blocks.
 So, the sharing is working quite well in this case.

What I'm thinking is, for a maildir, the new files are trickling in relatively
slowly over time, and there is no cached matching attribute found, so we keep
using new blocks.  Maybe ext3 could be extended to take a quick look for other
files in the directory which may match the new attribute, rather than relying on
"hopefully" finding one cached...?  I need to look at how ext3 handles this a
bit more.

I've asked Bill to try copying his maildir to a fresh dir, and point mutt at it;
presumeably all the files will share an attr block in that case, and if it's
fast, then we know that the block sharing is the issue.

Bill was talking about attr blocks 2G away from the inode, but I think he may
have been looking at the wrong output when determining the inode location, I'll
work w/ him on this to find out for sure.

Comment 30 Bill Nottingham 2006-10-30 18:58:48 UTC
So, upon further investigation, beagle (due to storing xattrs for mtime, atime,
and a uid of the file) makes xattr inode sharing impossible.

Comment 31 Bill Nottingham 2006-11-02 16:37:37 UTC
From the RHEL5 version of the bug, Eric Sandeen said:

We should probably turn on larger inodes in FC7...

Should we do this for FC7 by changing the kernel default, the mke2fs default, or
something else?

Comment 32 Eric Sandeen 2006-11-02 16:57:37 UTC
It's a mkfs-time thing.  Apparently upstream e2fsprogs may change the default
soon, at least for ext4 filesystems.

So simplest is to change it in anaconda.  I can bring up the idea of changing
the e2fsprogs default, although since larger inodes won't work w/ older kernels,
that may not be the best plan.

Comment 33 Bill Nottingham 2006-11-02 17:17:53 UTC
How much older of kernels?

Comment 34 Eric Paris 2007-01-08 19:34:39 UTC
esandeen so what are we thinking of doing with this bug for FC7 ?  How much
older kernels wouldn't understand it if we made the inodes large enough to hold
the xatts?

Comment 35 Eric Sandeen 2007-01-08 20:01:52 UTC
Sorry, missed Bill's earlier comment...

Looks like ea-in-inode support went in around 2.6.11.

If we expect users to be using lots of acls or attrs, then I think it would make
sense to enable larger inodes & in-inode attr storage for FC7.

-Eric

Comment 36 Eric Sandeen 2007-08-01 22:22:44 UTC
Or maybe enable it for FC8.... :)

Comment 37 Bill Nottingham 2007-08-02 14:44:19 UTC
See bug 243490 - it's not that simple.

Comment 38 Eric Sandeen 2007-08-13 18:08:15 UTC
Rawhide grub can now grok larger inodes.

Doing this might be good for making the transition to ext4 simpler down the
line, as well.  (ext4 also wants larger inodes, to store extent data).

Should I open a new RFE for larger ext3 inodes in F8...?

-Eric

Comment 39 Bill Nottingham 2007-08-29 16:49:11 UTC
Except for we just missed the feature freeze... sure! Might want to file it to
see if we can slip it in.

Comment 40 Bug Zapper 2008-04-03 18:08:25 UTC
Based on the date this bug was created, it appears to have been reported
against rawhide during the development of a Fedora release that is no
longer maintained. In order to refocus our efforts as a project we are
flagging all of the open bugs for releases which are no longer
maintained. If this bug remains in NEEDINFO thirty (30) days from now,
we will automatically close it.

If you can reproduce this bug in a maintained Fedora version (7, 8, or
rawhide), please change this bug to the respective version and change
the status to ASSIGNED. (If you're unable to change the bug's version
or status, add a comment to the bug and someone will change it for you.)

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we're following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.

Comment 41 Kostas Georgiou 2008-04-03 20:14:55 UTC
Have the defaults been changed for F9? If yes then the bug can be closed I guess. 

Comment 42 Eric Sandeen 2008-04-03 20:54:33 UTC
F9 now has 256-byte ext3 inodes by default, to allow inline attributes.  I would
expect that this will resolve it, but it'd be worth re-testing the original case.

Comment 43 Bug Zapper 2008-05-14 02:19:29 UTC
Changing version to '9' as part of upcoming Fedora 9 GA.
More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 44 Dave Jones 2008-06-04 19:28:46 UTC
Bill, can you retest? Do we still suck?

Comment 45 Bill Nottingham 2008-06-04 19:31:25 UTC
I am currently unable to reproduce the situation that would cause this, no.

Comment 46 Dave Jones 2008-06-04 19:40:12 UTC
awesome, we don't suck!