Bug 205161
Summary: | 11.5x performance degredation with SELinux on ext3 | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Bill Nottingham <notting> | ||||||
Component: | kernel | Assignee: | Eric Paris <eparis> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Brian Brock <bbrock> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 9 | CC: | eparis, esandeen, jmorris, kernel-maint, k.georgiou, rvokal, sct, sdsmall, sgrubb, tmraz, triage, wtogami | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | bzcl34nup | ||||||||
Fixed In Version: | F9 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2008-06-04 19:40:12 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 243490 | ||||||||
Bug Blocks: | 212651, 235704 | ||||||||
Attachments: |
|
Description
Bill Nottingham
2006-09-05 04:10:37 UTC
ugh, that is really bad. Any chance you can get some oprofile data on this ? Looking at it a different way, the mailbox in total is just short of 80MB, which yields a throughput of 13.5MB/sec in the non-SELinux case, and 1.1MB/sec in the SELinux case - does something SELinux does make the I/O layer very pessimized for small, continuous requests? It'd be interesting to see the output of 'avcstat 1' for the duration of the test with SELinux enabled, as well as /selinux/avc/hash_stats before and after. Also, please check the audit logs etc. in case it's generating a lot of avc messages. hash_stats before: entries: 502 buckets used: 307/512 longest chain: 6 hash_stats after: entries: 499 buckets used: 300/512 longest chain: 5 avcstat 1: lookups hits misses allocs reclaims frees 254774 253973 801 801 272 290 121 121 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 8 8 0 0 0 0 1 1 0 0 0 0 9 9 0 0 0 0 57 57 0 0 0 0 21 21 0 0 0 0 674 671 3 3 16 16 1821 1821 0 0 0 0 1256 1256 0 0 0 0 1242 1242 0 0 0 0 1205 1205 0 0 0 0 1200 1200 0 0 0 0 1096 1096 0 0 0 0 1091 1091 0 0 0 0 1177 1177 0 0 0 0 1211 1211 0 0 0 0 1461 1461 0 0 0 0 1095 1095 0 0 0 0 840 840 0 0 0 0 lookups hits misses allocs reclaims frees 302 302 0 0 0 0 1160 1160 0 0 0 0 1218 1218 0 0 0 0 882 882 0 0 0 0 1204 1204 0 0 0 0 1160 1160 0 0 0 0 1212 1212 0 0 0 0 1172 1172 0 0 0 0 919 919 0 0 0 0 1038 1038 0 0 0 0 1158 1158 0 0 0 0 1178 1178 0 0 0 0 1252 1252 0 0 0 0 953 953 0 0 0 0 1160 1160 0 0 0 0 1331 1331 0 0 0 0 1109 1109 0 0 0 0 1192 1192 0 0 0 0 978 978 0 0 0 0 1136 1136 0 0 0 0 1072 1072 0 0 0 0 1304 1304 0 0 0 0 lookups hits misses allocs reclaims frees 1168 1168 0 0 0 0 818 818 0 0 0 0 1107 1107 0 0 0 0 1061 1061 0 0 0 0 1240 1240 0 0 0 0 1191 1191 0 0 0 0 970 970 0 0 0 0 883 883 0 0 0 0 1092 1092 0 0 0 0 1150 1150 0 0 0 0 987 987 0 0 0 0 795 795 0 0 0 0 994 994 0 0 0 0 1006 1006 0 0 0 0 1165 1165 0 0 0 0 As for avc messages, none in dmesg (was testing in single user mode to avoid noise in the data, so pre-auditd.) These numbers all look normal. Is the audit daemon enabled and did you check in /var/log/audit/audit.log for avc messages? See end of comment #4. :) Created attachment 135598 [details]
oprofile report
Here's the oprofile report. check_poison_obj seems high.
What does strace -fc (and maybe ltrace) show with and without SELinux enabled? With SELinux: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 82.98 0.601081 139 4312 11 stat64 14.36 0.104035 25 4192 read 1.99 0.014420 3 4205 13 open 0.26 0.001860 2 830 write 0.24 0.001766 9 194 getdents64 ... (rest elided) Without SELinux: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 43.31 0.035156 8 4192 read 24.72 0.020063 5 4312 11 stat64 15.75 0.012784 66 194 getdents64 12.30 0.009982 2 4205 13 open 2.63 0.002137 3 830 write 0.38 0.000305 0 4111 munmap 0.31 0.000255 0 4172 mmap2 0.25 0.000204 0 8282 fstat64 0.18 0.000145 0 4091 _llseek 0.17 0.000142 0 4194 close ... The obvious thing that stands out is the increased stat & read times, of course. Created attachment 139006 [details]
First shot at hashing the context to sid conversion
This patch reuses the data in the sidtab. It actually creates a second htable
called con_htable and the nodes inside the table allow for a second list
(con_next instead of just next)
for testing purposes /proc/sys/kernel/contab has been included. If set to 0 it
should use the original search over all entries in the sidtab. If set to 1 it
should instead try to search the parallel contab. If results are positive the
tunable will be dropped for submission anywhere.
Watch out for convert_context() in services.c. Remaps the context structure values upon a policy reload, thereby changing your hash keys in your table. So I've been looking at this for just a second. I originally put a fix in sidtab_map_remove_on_error that just removed every 'entry' from the contab and the reinserted it after the apply. Then I decided to change it and instead remove on apply() failure or on the hash bucket changing. But my question becomes, is it worth it to conditionally remove the entry from the contab? How often do we remap contexts but not actually have any of the user/role/type change? Would that be so common on policy reload that I should care about saving that little bit of time? I might just go back to unconditionally removing and reinserting only when needed, makes the code shorter even if it will be a waste of time when loading new policy... As a side not I figured I'd give you the stats I see. Using a pretty up to date fedora system and policy I'm getting the following after semanage -R sids: 445 entries and 128/128 buckets used, longest chain length 4 contab has 123/128 buckets used, longest chain length 24 So we do use almost all of the buckets and our longest search went from 445 to 24. I'd say that's a pretty good cut in worst case performance. Do you think I would help my distribution at all if I added in something from the MLS field to get our hash key? Do you think it even matters? (oh yeah, and no fair having a nice linearly distributed thing like a sid to use as the key) - As load policy is not performance critical, always removing and re-inserting the entry is no big deal. The user and role values will tend to remain stable, but e.g. inserting a policy module might perturb the type value space. - On the hash function, keep in mind that there will typically be a small number of distinct SELinux users and roles but a large number of types. The MLS levels won't show up unless you are actually using multiple levels, which won't be the case in a default install (until someone actually defines categories and uses them for MCS, or in the MLS/LSPP case). Or you could use jhash_nwords() from include/linux/jhash.h and see how it performs. - The real question of course is whether this makes any perceptible difference in the performance overhead as described in the bug. Test kernel: real 1m14.009s user 0m0.400s sys 0m0.500s Test kernel w/selinux=0: real 0m5.240s user 0m0.400s sys 0m0.308s In other words, no real improvement. What additional IOs does SELinux generate on stat, open, or read? When a dentry is first instantiated for the inode, SELinux has to invoke the ->getxattr method of the inode to obtain its security attribute, which is then mapped to a SID and cached in the incore inode security struct. Once that is done, there should be no further additional IO; just a permission check, unless of course you are generating audit messages due to denials. Or if using in-inode EAs (1), the getxattr would presumably not involve any additional IO at all. (1) http://marc.theaimsgroup.com/?l=git-commits-head&m=110583624019158&w=2 Hm, ok. If the EA is in-inode, that *shouldn't* disrupt the IO pattern, I would think. But something obviously is. After talking with Eric Sandeen, it seems that EAs are *not* in-inode on ext3 by default. So, this could explain the IO pattern changes. I'll attempt a test on a system with bigger inodes to see if it makes a difference. A test of a system with in-inode EAs doesn't seem to show much difference, but it is a somewhat disparate system (different processor, storage, etc.) So, being unable to reliably reproduce this on fresh systems, I've investigated the FS I'm seeing it on. For all the mail files in question (one per message), I see FS information like: debugfs: stat /home/notting/mail/other/cur/1040422194.11527_0.devserv.devel.redhat.com,U= 1505:2,S Inode: 5670539 Type: regular Mode: 0600 Flags: 0x0 Generation: 3229277588 User: 2166 Group: 2167 Size: 3389 File ACL: 6223838 Directory ACL: 0 debugfs: stat /home/notting/mail/other/cur/1141659085.2391_0.devserv.devel.redhat.com,U=3 743:2,RS Inode: 5672771 Type: regular Mode: 0600 Flags: 0x0 Generation: 3229279820 User: 2166 Group: 2167 Size: 2113 File ACL: 6225525 Directory ACL: 0 ... So, the ACLs are out of inode, plus they aren't particularly local to the file in question. If my math (and understanding of the numbers) is to be trusted, they're about 2GB apart on the disk, in general. Therefore, for each file read/stat, the getxattr has to seek 2GB across the disk, and then seek back for the next file. This would *have* to slow things down tremendously. A freshly created filesystem probably wouldn't have these locality issues, and therefore probably wouldn't show this. In any case, we may want consider bumping the default inode size for ext3 so that xattrs are in-inode, just to avoid the chance of machines degrading to this state. CC'ing ext3 gurus. (An interesting side effect of this is that, if I understand it right, the longer you'd run a system like this without SELinux, the more likely the relabel is to have to create xattrs in faraway disk blocks, making the problem worse...) EAs are not in inode, but there _should_ be sharing so that multiple related files with the same EAs get the same storage. Is that mechanism failing here such that each inode is getting its own EA block, perhaps? (debugfs log is debugfs stat output for every file) [notting@nostromo: /tmp]$ grep "File ACL" debugfslog | wc -l 4188 [notting@nostromo: /tmp]$ grep "File ACL" debugfslog | sort -u | wc -l 4087 Doesn't appear to be much sharing. Couple of questions: 1) The selinux xattrs are different from any ACLs on the file - are you certain that those debug messages are giving information about the selinux labels? 2) When considering a larger default values, we need to consider all xattrs not just selinux contexts, correct? That includes beagle and other heavy xattr users. As for the selinux component, most contexts are not that long though of course there is no real restriction. A long example would be: system_u:object_r:httpd_unconfined_script_exec_t:s0-s0:c0,c12,c15 Only ~70 characters. Hmm. one other twist here; ext3 -should- be sharing a single acl filesystem block for up to 1024 files that share the same attribute. This worked in a very simple non-selinux attr test I did... but in notting's case it's not. Proper sharing could well resolve the performance impact, w/o needing to change the filesystem geometry. I will look into this... 1) How would I tell? I haven't set any ACLs up by hand, if that's what you mean. were the stats in comment #22 for a complete filesystem, or a directory within a larger filesystem? A single directory (holding the mail in question that's the test case here.) Hm, so it's possible that there is good sharing, but sharing across the filesystem and not in a single directory; I'll have to look at how the hashing/sharing algorithm works, and how the relabeler works... I'd expect it to do things in roughly directory order, and wind up sharing blocks that same way... re: comment #23, the block-sharing tests Bill did should be accurate; although the selinux xattrs are a different namespace, they are stored in the same way as other xattrs, in the same on-disk locations. I did a test on a freshly installed RHEL5 snapshot, and of the first 32k or so system files I checked, I found 147 unique policies, and 208 unique acl blocks. So, the sharing is working quite well in this case. What I'm thinking is, for a maildir, the new files are trickling in relatively slowly over time, and there is no cached matching attribute found, so we keep using new blocks. Maybe ext3 could be extended to take a quick look for other files in the directory which may match the new attribute, rather than relying on "hopefully" finding one cached...? I need to look at how ext3 handles this a bit more. I've asked Bill to try copying his maildir to a fresh dir, and point mutt at it; presumeably all the files will share an attr block in that case, and if it's fast, then we know that the block sharing is the issue. Bill was talking about attr blocks 2G away from the inode, but I think he may have been looking at the wrong output when determining the inode location, I'll work w/ him on this to find out for sure. So, upon further investigation, beagle (due to storing xattrs for mtime, atime, and a uid of the file) makes xattr inode sharing impossible. From the RHEL5 version of the bug, Eric Sandeen said: We should probably turn on larger inodes in FC7... Should we do this for FC7 by changing the kernel default, the mke2fs default, or something else? It's a mkfs-time thing. Apparently upstream e2fsprogs may change the default soon, at least for ext4 filesystems. So simplest is to change it in anaconda. I can bring up the idea of changing the e2fsprogs default, although since larger inodes won't work w/ older kernels, that may not be the best plan. How much older of kernels? esandeen so what are we thinking of doing with this bug for FC7 ? How much older kernels wouldn't understand it if we made the inodes large enough to hold the xatts? Sorry, missed Bill's earlier comment... Looks like ea-in-inode support went in around 2.6.11. If we expect users to be using lots of acls or attrs, then I think it would make sense to enable larger inodes & in-inode attr storage for FC7. -Eric Or maybe enable it for FC8.... :) See bug 243490 - it's not that simple. Rawhide grub can now grok larger inodes. Doing this might be good for making the transition to ext4 simpler down the line, as well. (ext4 also wants larger inodes, to store extent data). Should I open a new RFE for larger ext3 inodes in F8...? -Eric Except for we just missed the feature freeze... sure! Might want to file it to see if we can slip it in. Based on the date this bug was created, it appears to have been reported against rawhide during the development of a Fedora release that is no longer maintained. In order to refocus our efforts as a project we are flagging all of the open bugs for releases which are no longer maintained. If this bug remains in NEEDINFO thirty (30) days from now, we will automatically close it. If you can reproduce this bug in a maintained Fedora version (7, 8, or rawhide), please change this bug to the respective version and change the status to ASSIGNED. (If you're unable to change the bug's version or status, add a comment to the bug and someone will change it for you.) Thanks for your help, and we apologize again that we haven't handled these issues to this point. The process we're following is outlined here: http://fedoraproject.org/wiki/BugZappers/F9CleanUp We will be following the process here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this doesn't happen again. Have the defaults been changed for F9? If yes then the bug can be closed I guess. F9 now has 256-byte ext3 inodes by default, to allow inline attributes. I would expect that this will resolve it, but it'd be worth re-testing the original case. Changing version to '9' as part of upcoming Fedora 9 GA. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping Bill, can you retest? Do we still suck? I am currently unable to reproduce the situation that would cause this, no. awesome, we don't suck! |