Bug 1075132
| Summary: | Excessive LDAP calls by ipa-sam during file operations to samba file share on freeipa master cause high CPU and slow performance | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Dmitri Pal <dpal> |
| Component: | ipa | Assignee: | Martin Kosek <mkosek> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Namita Soman <nsoman> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 7.0 | CC: | abokovoy, asn, devel, mkosek, rcritten, spoore |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | ipa-3.3.3-22.el7 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1074314 | Environment: | |
| Last Closed: | 2014-06-13 13:04:01 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1074314 | ||
| Bug Blocks: | 1073810 | ||
|
Comment 2
Dmitri Pal
2014-03-11 20:07:23 UTC
Fixed upstream master: https://fedorahosted.org/freeipa/changeset/d6a7923f71eb69bac53d6ff904086a9abd103dbc ipa-3-3: https://fedorahosted.org/freeipa/changeset/13cd4faf551d7781d27c36bef0e7cbf515e072d2 Need to get confirmation that this is a valid verification test: ON VERSION ipa-server-3.3.3-19.el7.x86_64: /[root@master freeipa]# rm -rf /bz1075132/* [root@master freeipa]# net cache flush [root@master freeipa]# net conf setparm global 'log level' 10 [root@master freeipa]# for i in /var/log/samba/log.*; do > $i; done [root@master freeipa]# cd /source/freeipa [root@master freeipa]# time echo -e 'recurse\nprompt n\nmput *'|\ > smbclient -k //master.ipa1.example.test/share > /tmp/smbclient.out 2>&1 real 0m55.811s user 0m0.493s sys 0m0.665s [root@master freeipa]# grep ldap_search /var/log/samba/* 2>/dev/null | wc -l 9778 Large number of ldap_search entries in logs. Now, I upgrade ipa-server to newer/fixed version: ON VERSION ipa-server-3.3.3-25.el7.x86_64: [root@master freeipa]# rm -rf /bz1075132/* [root@master freeipa]# net cache flush Couldn't delete entry! key = IDMAP/SID2XID/S-1-5-2 Couldn't delete entry! key = IDMAP/SID2XID/S-1-5-32-545 Couldn't delete entry! key = IDMAP/SID2XID/S-1-1-0 [root@master freeipa]# net conf setparm global 'log level' 10 [root@master freeipa]# for i in /var/log/samba/log.*; do > $i; done [root@master freeipa]# cd /source/freeipa [root@master freeipa]# time echo -e 'recurse\nprompt n\nmput *'|\ > smbclient -k //master.ipa1.example.test/share > /tmp/smbclient.out 2>&1 real 0m50.562s user 0m0.508s sys 0m0.676s [root@master freeipa]# grep ldap_search /var/log/samba/* 2>/dev/null | wc -l 98 About 1% of the number of ldap_search entries found. Running time not much different but, I'm running it on a limited VM. Hi Scott, CPU was very high maybe that's easier to measure? Or not heh. For me the slow performance was as a direct result of the CPU hitting the roof. Maybe iostat while running test? For me after the patch the CPU never went above 5% where before it was mostly 80-100% on a single core. Jason Jason, How large a sample for file transfer did you use? Did the sample included files owned by many different users? CPU didn't spike when I tried this on the 3.3.3-19 version. My sample for the transfer was a clone of the freeipa git repo. That was only 60MB though I think and I only had one non-root user own it with a different group. Maybe that's my problem? Too small a sample and too few users/groups to lookup? Thanks, Scott Hi Scott, Maybe 3.3.3 proved slapd? I was on 3.0.0-37 EL6_4 and slapd was using lots of CPU processing the LDAP queries. I was using Mac OS X 10.9 client copying 200+MB of various git repo data. Slapd on server would spike and be constantly high CPU 80-100. To be fair I guess it can be said if LDAP queries are 1% of previous it's enough to justify some (even if minimal) improvement to performance. Thanks for all the help integrating the patch! Jason "...3.3.3 improved** slapd" I asked Alexander about reproduction of this bug last week and he advised running: # time for i in `seq 0 150` ; do echo "lookupnames admin" ; done | rpcclient -k `hostname` >/dev/null The shoud be a difference (roughly 3-4 times more) in time spent in ns-slapd and sssd in the older case. As for your current reproduction scenario, I will let Alexander to decide. It could also be noted that I was connected as a user which was a member of 2 groups - each group a member of another group - so 4 groups. user - group1,group2,group1_parent,group2_parent Scott,
I think what you can use is my performance testing script.
# cat testperformance.sh
#!/bin/bash
REALM=MY-REALM
net cache flush
net conf setparm global 'log level' 10
for i in /var/log/dirsrv/slapd-$REALM/access /var/log/samba/log.*; do > $i; done
USERCRED="admin%Test1234"
DOMSID=`net getdomainsid`
COUNTER=0
while [ $COUNTER -lt 80 ]; do
echo "lookupsid ${DOMSID}-$((1000 + COUNTER))"
echo "lookupnames user${COUNTER} users${COUNTER}"
let COUNTER=COUNTER+1
done | rpcclient -U${USERCRED} `hostname` >/dev/null
net conf setparm global 'log level' 1
echo -n "Samba ldap_search operations:"
grep ldap_search /var/log/samba/* 2>/dev/null | wc -l
# Sleep a bit because DS pushes logs with a delay
sleep 5
echo -n "DS ldap_search operations:"
grep SRCH /var/log/dirsrv/slapd-*/access 2>/dev/null | wc -l
# kinit admin
....
# for i in `seq 1 30` ; do ipa user-add --first="Test" --last="User ${i}" user${i} ; done
....
# for i in `seq 1 30` ; do ipa group-add --desc="Test Group ${i}" users${i} ; done
# time sh testperformance.sh
With this test you would have 30 users and 30 groups and ask rpcclient to lookup SIDs for groups and users.
Again, I don't see a big difference in performance but, I do see the big search count difference: Test with version ipa-server-3.3.3-19.el7.x86_64: [root@master ~]# time sh testperformance.sh Samba ldap_search operations:800 DS ldap_search operations:0 real 0m9.411s user 0m0.360s sys 0m0.180s Test with version ipa-server-3.3.3-25.el7.x86_64: [root@master ~]# time sh testperformance.sh Couldn't delete entry! key = IDMAP/SID2XID/S-1-5-2 Couldn't delete entry! key = IDMAP/SID2XID/S-1-5-32-545 Couldn't delete entry! key = IDMAP/SID2XID/S-1-1-0 Samba ldap_search operations:8 DS ldap_search operations:0 real 0m8.965s user 0m0.342s sys 0m0.192s It should be noted that the above test for -19 release was done after an rpm downgrade from -25. I'll reload my VM and start from scratch to get a clean run to see if anything differs. Thanks, Scott Ok. Make sure you adjust 'sleep 5' to some usable value so that DS has chance to flush its logs to the file. It depends on your VM, in general it may tak around 5-10 seconds before you'll get right data in the access log. Ok, with the rebuild I'm seeing different results. I don't know if my previous results were maybe due to other configurations and/or uninstall/re-install executions. Now, I'm seeing similar results regardless. What exactly should I see with this test on RHEL7? faster response times? fewer log entries? right? ipa-server-3.3.3-19.el7.x86_64: # time sh testperformance.sh Samba ldap_search operations:16 DS ldap_search operations:979 real 0m38.769s user 0m0.350s sys 0m0.281s ipa-server-3.3.3-22.el7.x86_64: # time sh testperformance.sh Samba ldap_search operations:14 DS ldap_search operations:1002 real 0m39.079s user 0m0.287s sys 0m0.270s Note above I have set the sleep delay to 30 seconds to try to make sure I catch anything I might have missed. Now, that being said, I re-ran the smbclient tests and those still do look the same. Some kind of caching going on with the other test? Or is that test just a generic perf test and not specific to this bug? ipa-server-3.3.3-19.el7.x86_64: # time echo -e 'recurse\nprompt n\nmput *'|\ > smbclient -k //$(hostname)/share > /tmp/smbclient.out 2>&1 real 0m26.538s user 0m0.351s sys 0m0.509s # grep ldap_search /var/log/samba/* 2>/dev/null | wc -l 10912 # grep SRCH /var/log/dirsrv/slapd-*/access 2>/dev/null | wc -l 4568 ipa-server-3.3.3-22.el7.x86_64: # time echo -e 'recurse\nprompt n\nmput *'|\ > smbclient -k //$(hostname)/share > /tmp/smbclient.out 2>&1 real 0m21.921s user 0m0.341s sys 0m0.542s # grep ldap_search /var/log/samba/* 2>/dev/null | wc -l 82 # grep SRCH /var/log/dirsrv/slapd-*/access 2>/dev/null | wc -l 51 And yes, I waited between the greps. So, What am I seeing above with the rpcclient tests showing about the same results regardless of patch? Thanks, Scott Ok, so what we are seeing in rpcclient tests is probably an influence of the caching in the Samba layers. When copying files there are a lot of calls that attempt to set and verify permissions and calculate access masks, etc. They lead to fetching domain SID and other operations which are not really about name2sid/sid2name/sid2id operations so having multiple requests towards the identity store is expected. However, since your file store is largerly fixed, there are probably not many identities who own the files and thus, less to resolve. Samba caches most of these requests and adding caching of idmap in ipasam helps to reduce number of repetitive calls in access masks and setting permissions, I guess. After all, you'd need to perform it for every file that is copied which multiplies the number of idmap requests. If the rpcclient tests are being affected by some samba caching, can we use those to verify this bug or its fix? If smbclient tests are making calls that are irrelevant here, why is there such a drastic difference in count between the two system running the different versions? Thanks, Scott After some review by dev, it was determined that we cannot at this time use rpcclient for the test. "with the patch POSIX ID related queries are cached and rpclient does not know about POSIX IDs" So, I'm doing the verification with smbclient. First, the reproduced results:
[root@cloud-qe-18 ~]# # add user/group
[root@cloud-qe-18 ~]# service sssd stop; rm -rf /var/lib/sss/{db,mc}/*; service sssd start
Redirecting to /bin/systemctl stop sssd.service
Redirecting to /bin/systemctl start sssd.service
[root@cloud-qe-18 ~]# kdestroy -A ; echo Secret123 | kinit admin
Password for admin:
[root@cloud-qe-18 ~]# ipa user-del bz1075132
ipa: ERROR: bz1075132: user not found
[root@cloud-qe-18 ~]# ipa user-add bz1075132 --first=f --last=l
----------------------
Added user "bz1075132"
----------------------
User login: bz1075132
First name: f
Last name: l
Full name: f l
Display name: f l
Initials: fl
Home directory: /home/bz1075132
GECOS: f l
Login shell: /bin/sh
Kerberos principal: bz1075132
Email address: bz1075132
UID: 1810200254
GID: 1810200254
Password: False
Member of groups: ipausers
Kerberos keys available: False
[root@cloud-qe-18 ~]# echo -e 'redhat\nredhat' |ipa passwd bz1075132
----------------------------------------------------
Changed password for "bz1075132"
----------------------------------------------------
[root@cloud-qe-18 ~]# echo -e 'redhat\nSecret123\nSecret123'|kinit bz1075132
Password for bz1075132:
Password expired. You must change it now.
Enter new password:
Enter it again:
[root@cloud-qe-18 ~]# BZUID=$(getent passwd bz1075132|cut -f3 -d:)
[root@cloud-qe-18 ~]#
[root@cloud-qe-18 ~]# # create source dir
[root@cloud-qe-18 ~]# rm -rf /tmp/source
[root@cloud-qe-18 ~]# mkdir /tmp/source
[root@cloud-qe-18 ~]# cd /tmp/source
[root@cloud-qe-18 source]# cd /usr/bin
[root@cloud-qe-18 bin]# find . ! -type l -exec cp {} /tmp/source \;
cp: omitting directory ‘.’
[root@cloud-qe-18 bin]# chown -R bz1075132:bz1075132 /tmp/source
[root@cloud-qe-18 bin]#
[root@cloud-qe-18 bin]# # create target dir
[root@cloud-qe-18 bin]# rm -rf /tmp/target
[root@cloud-qe-18 bin]# mkdir /tmp/target
[root@cloud-qe-18 bin]# chcon -t samba_share_t /tmp/target
[root@cloud-qe-18 bin]# chown bz1075132:bz1075132 /tmp/target
[root@cloud-qe-18 bin]#
[root@cloud-qe-18 bin]# # setup samba share
[root@cloud-qe-18 bin]# net conf delshare 'share'
[root@cloud-qe-18 bin]# net conf setparm 'share' 'comment' 'Trust test share'
[root@cloud-qe-18 bin]# net conf setparm 'share' 'read only' 'no'
[root@cloud-qe-18 bin]# net conf setparm 'share' 'path' '/tmp/target'
[root@cloud-qe-18 bin]#
[root@cloud-qe-18 bin]# # setup for logging
[root@cloud-qe-18 bin]# I1=$(hostname -d|sed 's/\./-/g')
[root@cloud-qe-18 bin]# INSTANCE=${I1^^}
[root@cloud-qe-18 bin]# net cache flush
[root@cloud-qe-18 bin]# net conf setparm global 'log level' 10
[root@cloud-qe-18 bin]# for i in /var/log/dirsrv/slapd-$INSTANCE/access /var/log/samba/log.*; do > $i; done
[root@cloud-qe-18 bin]#
[root@cloud-qe-18 bin]# # transfer files
[root@cloud-qe-18 bin]# cd /tmp/source
[root@cloud-qe-18 source]# echo 'Secret123'|kinit bz1075132
Password for bz1075132:
[root@cloud-qe-18 source]# echo -e 'recurse\nprompt n\nmput *'| smbclient -k //$(hostname)/share > /dev/null 2>&1
[root@cloud-qe-18 source]#
[root@cloud-qe-18 source]# # check log counts
[root@cloud-qe-18 source]# sleep 30
[root@cloud-qe-18 source]# FILECOUNT=$(find /tmp/target|wc -l)
[root@cloud-qe-18 source]# FILECOUNT=$(( FILECOUNT += 10 ))
[root@cloud-qe-18 source]# LOGCOUNT1=$(grep ldap_search.*$BZUID /var/log/samba/* 2>/dev/null | wc -l)
[root@cloud-qe-18 source]# LOGCOUNT2=$(grep SRCH.*$BZUID /var/log/dirsrv/slapd-*/access 2>/dev/null | wc -l)
[root@cloud-qe-18 source]#
[root@cloud-qe-18 source]# echo "$FILECOUNT $LOGCOUNT1 $LOGCOUNT2"
1146 2272 2272
[root@cloud-qe-18 source]#
[root@cloud-qe-18 source]# if [ $LOGCOUNT1 -gt $FILECOUNT ]; then
> echo FAIL
> else
> echo PASS
> fi
FAIL
I increase the FILECOUNT check variable by 10 above to give some room for the test in case of unexpected log entries.
Verified.
Version ::
ipa-server-3.3.3-22.el7.x86_64
RESULTS ::
[root@cloud-qe-6 ~]# # add user/group
[root@cloud-qe-6 ~]# service sssd stop; rm -rf /var/lib/sss/{db,mc}/*; service sssd start
Redirecting to /bin/systemctl stop sssd.service
Redirecting to /bin/systemctl start sssd.service
[root@cloud-qe-6 ~]# kdestroy -A ; echo Secret123 | kinit admin
Password for admin:
[root@cloud-qe-6 ~]# ipa user-del bz1075132
ipa: ERROR: bz1075132: user not found
[root@cloud-qe-6 ~]# ipa user-add bz1075132 --first=f --last=l
----------------------
Added user "bz1075132"
----------------------
User login: bz1075132
First name: f
Last name: l
Full name: f l
Display name: f l
Initials: fl
Home directory: /home/bz1075132
GECOS: f l
Login shell: /bin/sh
Kerberos principal: bz1075132
Email address: bz1075132
UID: 597800254
GID: 597800254
Password: False
Member of groups: ipausers
Kerberos keys available: False
[root@cloud-qe-6 ~]# echo -e 'redhat\nredhat' |ipa passwd bz1075132
----------------------------------------------------
Changed password for "bz1075132"
----------------------------------------------------
[root@cloud-qe-6 ~]# echo -e 'redhat\nSecret123\nSecret123'|kinit bz1075132
Password for bz1075132:
Password expired. You must change it now.
Enter new password:
Enter it again:
[root@cloud-qe-6 ~]# BZUID=$(getent passwd bz1075132|cut -f3 -d:)
[root@cloud-qe-6 ~]#
[root@cloud-qe-6 ~]# # create source dir
[root@cloud-qe-6 ~]# rm -rf /tmp/source
[root@cloud-qe-6 ~]# mkdir /tmp/source
[root@cloud-qe-6 ~]# cd /tmp/source
[root@cloud-qe-6 source]# cd /usr/bin
[root@cloud-qe-6 bin]# find . ! -type l -exec cp {} /tmp/source \;
cp: omitting directory ‘.’
[root@cloud-qe-6 bin]# chown -R bz1075132:bz1075132 /tmp/source
[root@cloud-qe-6 bin]#
[root@cloud-qe-6 bin]# # create target dir
[root@cloud-qe-6 bin]# rm -rf /tmp/target
[root@cloud-qe-6 bin]# mkdir /tmp/target
[root@cloud-qe-6 bin]# chcon -t samba_share_t /tmp/target
[root@cloud-qe-6 bin]# chown bz1075132:bz1075132 /tmp/target
[root@cloud-qe-6 bin]#
[root@cloud-qe-6 bin]# # setup samba share
[root@cloud-qe-6 bin]# net conf delshare 'share'
[root@cloud-qe-6 bin]# net conf setparm 'share' 'comment' 'Trust test share'
[root@cloud-qe-6 bin]# net conf setparm 'share' 'read only' 'no'
[root@cloud-qe-6 bin]# net conf setparm 'share' 'path' '/tmp/target'
[root@cloud-qe-6 bin]#
[root@cloud-qe-6 bin]# # setup for logging
[root@cloud-qe-6 bin]# I1=$(hostname -d|sed 's/\./-/g')
[root@cloud-qe-6 bin]# INSTANCE=${I1^^}
[root@cloud-qe-6 bin]# net cache flush
[root@cloud-qe-6 bin]# net conf setparm global 'log level' 10
[root@cloud-qe-6 bin]# for i in /var/log/dirsrv/slapd-$INSTANCE/access /var/log/samba/log.*; do > $i; done
[root@cloud-qe-6 bin]#
[root@cloud-qe-6 bin]# # transfer files
[root@cloud-qe-6 bin]# cd /tmp/source
[root@cloud-qe-6 source]# echo 'Secret123'|kinit bz1075132
Password for bz1075132:
[root@cloud-qe-6 source]# echo -e 'recurse\nprompt n\nmput *'| smbclient -k //$(hostname)/share > /dev/null 2>&1
[root@cloud-qe-6 source]#
[root@cloud-qe-6 source]# # check log counts
[root@cloud-qe-6 source]# sleep 30
[root@cloud-qe-6 source]# FILECOUNT=$(find /tmp/target|wc -l)
[root@cloud-qe-6 source]# FILECOUNT=$(( FILECOUNT += 10 ))
[root@cloud-qe-6 source]# LOGCOUNT1=$(grep ldap_search.*$BZUID /var/log/samba/* 2>/dev/null | wc -l)
[root@cloud-qe-6 source]# LOGCOUNT2=$(grep SRCH.*$BZUID /var/log/dirsrv/slapd-*/access 2>/dev/null | wc -l)
[root@cloud-qe-6 source]#
[root@cloud-qe-6 source]# echo "$FILECOUNT $LOGCOUNT1 $LOGCOUNT2"
1155 1145 1145
[root@cloud-qe-6 source]#
[root@cloud-qe-6 source]# if [ $LOGCOUNT1 -gt $FILECOUNT ]; then
> echo FAIL
> else
> echo PASS
> fi
PASS
And it should be noted that when the same test is run with admin:admins instead of new user:group, results were more dramatic:
Unpatched:
[root@cloud-qe-18 source]# BZUID=$(getent passwd admin|cut -f3 -d:)
[root@cloud-qe-18 source]# LOGCOUNT1=$(grep ldap_search.*$BZUID /var/log/samba/* 2>/dev/null | wc -l)
[root@cloud-qe-18 source]# LOGCOUNT2=$(grep SRCH.*$BZUID /var/log/dirsrv/slapd-*/access 2>/dev/null | wc -l)
[root@cloud-qe-18 source]#
[root@cloud-qe-18 source]# echo "$LOGCOUNT1 $LOGCOUNT2"
2272 2272
[root@cloud-qe-18 source]# find /tmp/target|wc -l
1136
Patched:
[root@cloud-qe-6 source]# BZUID=$(getent passwd admin|cut -f3 -d:)
[root@cloud-qe-6 source]# LOGCOUNT1=$(grep ldap_search.*$BZUID /var/log/samba/* 2>/dev/null | wc -l)
[root@cloud-qe-6 source]# LOGCOUNT2=$(grep SRCH.*$BZUID /var/log/dirsrv/slapd-*/access 2>/dev/null | wc -l)
[root@cloud-qe-6 source]# echo "$LOGCOUNT1 $LOGCOUNT2"
0 0
[root@cloud-qe-6 source]# find /tmp/target|wc -l
1145
This request was resolved in Red Hat Enterprise Linux 7.0. Contact your manager or support representative in case you have further questions about the request. |