Bug 1688848
Summary: | vsftpd sessions enabled causes continual growth of utmp and high system cpu% | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Welterlen Benoit <bwelterl> | ||||||
Component: | vsftpd | Assignee: | Ondřej Lysoněk <olysonek> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Ondrej Mejzlik <omejzlik> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 7.6 | CC: | ashankar, codonell, dj, fweimer, mnewsome, olysonek, omejzlik, pfrankli, rmetrich, thozza | ||||||
Target Milestone: | rc | Keywords: | Reproducer, TestCaseNotNeeded | ||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | vsftpd-3.0.2-27.el7 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1688852 1737433 (view as bug list) | Environment: | |||||||
Last Closed: | 2020-03-31 19:12:14 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1688852, 1709724, 1716960, 1737433 | ||||||||
Attachments: |
|
Created attachment 1544114 [details]
Fix SRPM
Thanks for the report. Do you have some evidence that the pututxline() calls in fact fail? And if they do fail, what is the errno? I was able to reproduce the 'gone - no logout' lines, however they disappear after some event (perhaps after the corresponding vsftpd processes terminate). And even though I got the 'gone - no logout' lines, none of the pututxline() calls in fact failed. Hello, Yes, I was also in the situation where 'gone - no logout' disappear. But the issue is that sometimes, the entry stays in the utmp file, because the call does not success. I made a ugly version where I print a message when the return is null: while (pututxline(&s_utent) == NULL) { str_alloc_text(&utmp_str, "pututxline return NULL, retrying \n"); str_syslog(&utmp_str,0); setutxent(); } And a previous version where I don't retry, and I can see the message when the utmp is broken. This bug turned out to be much more tricky than I thought. It turns out that the source of the signals that make pututxline() fail with EINTR is pututxline() itself - pututxline() does locking of the utmp file and uses alarm() to set a timeout on the locking operation. So what we observe as pututxline() failing with EINTR is in fact pututxline() timing out while attempting to lock a file. Now if I use the fix that repeats both setutxent() and pututxline() when pututxline() fails with EINTR, things can get ugly. This fix causes a situation to arise where many vsftpd processes keep re-trying setutxent() and pututxline() and none of them succeed. You could think that this cannot happen - if one process times out on a lock operation, it must mean that some other process is currently holding the lock and that process is moving forward. Unfortunatelly, this may not always be the case - the process that holds the lock may not be moving forward at all. The problem is that pututxline() locks the utmp file twice in a row and each of the two lock operations can time out. So it can happen that while the first locking succeeds, the second locking times out and consequently the whole pututxline() call fails. And if this happens for every process that is attempting to call setutxent() and pututxline(), none of the processes will move forward. They will keep looping, potentially forever, or at least for a very long time. In my testing, this situation arises fairly often. The solution to this problem is to make pututxline() remember that a previous call to pututxline has already done the first locking, remember that it has already done the work that needs to be done while that lock is held, and to reuse the result of that work. This could, in theory be achieved by repeating only pututxline() - not both setutxent() and pututxline(). However there is currently a glibc bug that prevents us from doing that - I'll file the bug shortly. Once the glibc bug is fixed, we should be able to safely keep repeating pututxline() until it succeeds and we can properly fix this vsftpd bug. But until the glibc bug is fixed, there's not much I can do in vsftpd. I can implement a partial fix, where the setutxline()-pututxline() combo is repeated only a couple of times and give up after the retries are exhausted. This may help a bit, but it's not a complete fix. (In reply to Welterlen Benoit from comment #0) > Actual results: > Entries in utmp broken, then the file size increases and performances fall I'd say it's in fact normal that the file keeps growing and subsequently the utmp file processing takes longer. I think the performance could be improved however by changing pututxline() to lock only the relevant portion of the file for writing - currently it locks the whole file, which increases contention. I think locking the whole file is unnecessary. I'll file a bug for this. (In reply to Ondřej Lysoněk from comment #13) > This could, in theory be achieved by repeating only pututxline() - not > both setutxent() and pututxline(). However there is currently a glibc > bug that prevents us from doing that - I'll file the bug shortly. I've filed bug#1734791 for this. (In reply to Ondřej Lysoněk from comment #13) > But until the glibc bug is fixed, there's not much I can do in > vsftpd. I can implement a partial fix, where the > setutxline()-pututxline() combo is repeated only a couple of times and > give up after the retries are exhausted. This may help a bit, but it's > not a complete fix. TL;DR: I ran some experiments to determine if repeating setutxline()-pututxline() is worth it. From the results of the experiment it seems that repeating it once improves the situation significantly, however anything more than that does not seem to be worth it. Long version: The experiment was to run the following 'for' loop twice in sequence: for i in {1..10000}; do echo -en "user user$((i%51))\\npass user$((i%51))\\nquit\\n" | nc localhost 21 & done > log; wait; I measured the time the second iteration has taken and counted the number of bad utmp entries when the test finished. I ran the experiment with unpatched vsftpd and several versions of patched vsftpd - the patches differed in the number of setutxent()-pututxline() attempts. The results follow. Test with unpatched vsftpd (number of utmp insert/remove attempts: 1) - Time taken: 4m6.468s - Number of bad utmp entries: 7192 Test with patched vsftpd, number of utmp insert/remove attempts: 2 - Time taken: 4m39.467s - Number of bad utmp entries: 1321 Test with patched vsftpd, number of utmp insert/remove attempts: 3 - Time taken: 5m0.778s - Number of bad utmp entries: 1300 Test with patched vsftpd, number of utmp insert/remove attempts: 8 Time taken: 6m22s Number of bad utmp entries: 1003 From these results we can see that raising the number of insert/remove attempts to 2 reduces the number of bad utmp entries significantly. Incresing the number of attempts further does not significantly affect the number of bad utmp entries, whereas the execution time grows approximately linearly. Clearing the dependency on glibc bug#1734791. We'll use this bug to do the partial fix. I've cloned this bug to bug#1737433 - we'll use that to fully resolve the problem after the glibc bug is fixed. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1010 |
Created attachment 1544113 [details] patch Description of problem: When session_support=YES in vsftpd-3.0.2-25.el7.x86_64, as the connections come in, /var/run/utmp will grow indefinitely and the number of users logged into the system (via top) will continue to grow even though the number of connections still active are not changing. If you execute `last -f /var/run/utmp` you can see thousands (depending on how long the system has been running) of users listed as "gone - no logout" . last -f /var/run/utmp ... user3 vsftpd:2244 192.168.122.1 Thu Mar 7 12:37 gone - no logout user2 vsftpd:2245 192.168.122.1 Thu Mar 7 12:37 gone - no logout user2 vsftpd:1878 192.168.122.1 Thu Mar 7 12:37 gone - no logout user6 vsftpd:1758 192.168.122.1 Thu Mar 7 12:37 gone - no logout user2 vsftpd:1701 192.168.122.1 Thu Mar 7 12:37 gone - no logout ... From the utmpdump: [7] [02245] [ ] [user2 ] [vsftpd:2245 ] [192.168.122.1 ] [0.0.0.0 ] [Thu Mar 07 12:37:23 2019 EST] we can see that the entry was not updated during the logout. We also see some logout entries that do not correspond to login entries. Version-Release number of selected component (if applicable): - RHEL 7 - vsftpd-3.0.2-25 How reproducible: With a lot of simultaneous connections Steps to Reproduce: 1. enable session in vsftpd.conf session_support=YES 2. remove the limitation per ip to reproduce the issue: max_per_ip=0 3. run the server and lot of clients: #!/bin/bash function ftpcommand () { HOST='192.168.122.155' PASSWD='user' for user in user1 user2 user3 user4 user5 user6; do ftp -inv $HOST << EOF quote USER $user quote PASS $PASSWD bye EOF done } while true; do for i in {1..10}; do ftpcommand & #sleep 0.1 done sleep .5 done Actual results: Entries in utmp broken, then the file size increases and performances fall Expected results: Clean utmp file Additional info: The issue is in the utmp management: sysdeputil.c. The return code of pututxline is not used, then if it fails, the vsf_remove_uwtmp will add an entry instead of removing it. One solution is to loop until it's successful (with maybe a max) or update s_uwtmp_inserted: --- 1/sysdeputil.c 2019-03-14 07:11:33.404248709 -0400 +++ 2/sysdeputil.c 2019-03-14 07:14:26.947135914 -0400 @@ -1233,7 +1233,8 @@ sizeof(s_utent.ut_host)); s_utent.ut_tv.tv_sec = vsf_sysutil_get_time_sec(); setutxent(); - (void) pututxline(&s_utent); + while (pututxline(&s_utent) == NULL) + setutxent(); endutxent(); updwtmpx(WTMPX_FILE, &s_utent); } @@ -1251,7 +1252,8 @@ vsf_sysutil_memclr(s_utent.ut_host, sizeof(s_utent.ut_host)); s_utent.ut_tv.tv_sec = 0; setutxent(); - (void) pututxline(&s_utent); + while (pututxline(&s_utent) == NULL) + setutxent(); endutxent(); s_utent.ut_tv.tv_sec = vsf_sysutil_get_time_sec(); updwtmpx(WTMPX_FILE, &s_utent);