1688848 – vsftpd sessions enabled causes continual growth of utmp and high system cpu%

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1688848 - vsftpd sessions enabled causes continual growth of utmp and high system cpu%

Summary: vsftpd sessions enabled causes continual growth of utmp and high system cpu%

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	vsftpd
Sub Component:
Version:	7.6
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Ondřej Lysoněk
QA Contact:	Ondrej Mejzlik
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1688852 1709724 1716960 1737433
TreeView+	depends on / blocked

Reported:	2019-03-14 15:11 UTC by Welterlen Benoit
Modified:	2023-03-24 14:38 UTC (History)
CC List:	10 users (show)
Fixed In Version:	vsftpd-3.0.2-27.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1688852 1737433 (view as bug list)
Environment:
Last Closed:	2020-03-31 19:12:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
patch (721 bytes, patch) 2019-03-14 15:11 UTC, Welterlen Benoit	no flags	Details \| Diff
Fix SRPM (266.50 KB, application/x-rpm) 2019-03-14 15:19 UTC, Welterlen Benoit	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:1010	0	None	None	None	2020-03-31 19:12:27 UTC

Description Welterlen Benoit 2019-03-14 15:11:23 UTC

Created attachment 1544113 [details]
patch

Description of problem:

When session_support=YES in vsftpd-3.0.2-25.el7.x86_64, as the connections come in, /var/run/utmp will grow indefinitely and the number of users logged into the system (via top) will continue to grow even though the number of connections still active are not changing.

If you execute `last -f /var/run/utmp` you can see thousands (depending on how long the system has been running) of users listed as "gone - no logout" .  

 last -f /var/run/utmp
...
user3    vsftpd:2244  192.168.122.1    Thu Mar  7 12:37    gone - no logout 
user2    vsftpd:2245  192.168.122.1    Thu Mar  7 12:37    gone - no logout 
user2    vsftpd:1878  192.168.122.1    Thu Mar  7 12:37    gone - no logout 
user6    vsftpd:1758  192.168.122.1    Thu Mar  7 12:37    gone - no logout 
user2    vsftpd:1701  192.168.122.1    Thu Mar  7 12:37    gone - no logout
...

From the utmpdump:
[7] [02245] [    ] [user2   ] [vsftpd:2245 ] [192.168.122.1       ] [0.0.0.0        ] [Thu Mar 07 12:37:23 2019 EST]
we can see that the entry was not updated during the logout.
We also see some logout entries that do not correspond to login entries.

Version-Release number of selected component (if applicable):
- RHEL 7
- vsftpd-3.0.2-25

How reproducible:
With a lot of simultaneous connections

Steps to Reproduce:
1. enable session in vsftpd.conf session_support=YES
2. remove the limitation per ip to reproduce the issue: max_per_ip=0
3. run the server and lot of clients:

#!/bin/bash

function ftpcommand () {
    HOST='192.168.122.155'
    PASSWD='user'
    for user in user1 user2 user3 user4 user5 user6; do
       ftp -inv $HOST << EOF
       quote USER $user
       quote PASS $PASSWD
       bye
EOF
    done
}
while true; do
    for i in {1..10}; do
        ftpcommand &
        #sleep 0.1
    done
    sleep .5
done


Actual results:
Entries in utmp broken, then the file size increases and performances fall

Expected results:
Clean utmp file

Additional info:
The issue is in the utmp management: sysdeputil.c.
The return code of pututxline is not used, then if it fails, the vsf_remove_uwtmp will add an entry instead of removing it.

One solution is to loop until it's successful (with maybe a max) or update s_uwtmp_inserted:

--- 1/sysdeputil.c      2019-03-14 07:11:33.404248709 -0400
+++ 2/sysdeputil.c      2019-03-14 07:14:26.947135914 -0400
@@ -1233,7 +1233,8 @@
                      sizeof(s_utent.ut_host));
   s_utent.ut_tv.tv_sec = vsf_sysutil_get_time_sec();
   setutxent();
-  (void) pututxline(&s_utent);
+  while (pututxline(&s_utent) == NULL)
+    setutxent();
   endutxent();
   updwtmpx(WTMPX_FILE, &s_utent);
 }
@@ -1251,7 +1252,8 @@
   vsf_sysutil_memclr(s_utent.ut_host, sizeof(s_utent.ut_host));
   s_utent.ut_tv.tv_sec = 0;
   setutxent();
-  (void) pututxline(&s_utent);
+  while (pututxline(&s_utent) == NULL)
+    setutxent();
   endutxent();
   s_utent.ut_tv.tv_sec = vsf_sysutil_get_time_sec();
   updwtmpx(WTMPX_FILE, &s_utent);

Comment 2 Welterlen Benoit 2019-03-14 15:19:07 UTC

Created attachment 1544114 [details]
Fix SRPM

Comment 3 Ondřej Lysoněk 2019-03-15 14:20:44 UTC

Thanks for the report.

Do you have some evidence that the pututxline() calls in fact fail? And if they do fail, what is the errno?

I was able to reproduce the 'gone - no logout' lines, however they disappear after some event (perhaps after the corresponding vsftpd processes terminate). And even though I got the 'gone - no logout' lines, none of the pututxline() calls in fact failed.

Comment 4 Welterlen Benoit 2019-03-15 14:26:29 UTC

Hello,

Yes, I was also in the situation where 'gone - no logout' disappear. But the issue is that sometimes, the entry stays in the utmp file, because the call does not success.

I made a ugly version where I print a message when the return is null:

  while (pututxline(&s_utent) == NULL) {
        str_alloc_text(&utmp_str, "pututxline return NULL, retrying \n");
        str_syslog(&utmp_str,0);
        setutxent();
}

And a previous version where I don't retry, and I can see the message when the utmp is broken.

Comment 13 Ondřej Lysoněk 2019-07-31 12:40:10 UTC

This bug turned out to be much more tricky than I thought. It turns
out that the source of the signals that make pututxline() fail with
EINTR is pututxline() itself - pututxline() does locking of the utmp
file and uses alarm() to set a timeout on the locking operation. So
what we observe as pututxline() failing with EINTR is in fact
pututxline() timing out while attempting to lock a file.

Now if I use the fix that repeats both setutxent() and pututxline()
when pututxline() fails with EINTR, things can get ugly. This fix
causes a situation to arise where many vsftpd processes keep re-trying
setutxent() and pututxline() and none of them succeed.

You could think that this cannot happen - if one process times out on
a lock operation, it must mean that some other process is currently
holding the lock and that process is moving forward. Unfortunatelly,
this may not always be the case - the process that holds the lock may
not be moving forward at all.

The problem is that pututxline() locks the utmp file twice in a row
and each of the two lock operations can time out. So it can happen
that while the first locking succeeds, the second locking times out
and consequently the whole pututxline() call fails. And if this
happens for every process that is attempting to call setutxent() and
pututxline(), none of the processes will move forward.  They will keep
looping, potentially forever, or at least for a very long time. In my
testing, this situation arises fairly often.

The solution to this problem is to make pututxline() remember that a
previous call to pututxline has already done the first locking,
remember that it has already done the work that needs to be done while
that lock is held, and to reuse the result of that work.

This could, in theory be achieved by repeating only pututxline() - not
both setutxent() and pututxline(). However there is currently a glibc
bug that prevents us from doing that - I'll file the bug shortly.

Once the glibc bug is fixed, we should be able to safely keep
repeating pututxline() until it succeeds and we can properly fix this
vsftpd bug.

But until the glibc bug is fixed, there's not much I can do in
vsftpd. I can implement a partial fix, where the
setutxline()-pututxline() combo is repeated only a couple of times and
give up after the retries are exhausted. This may help a bit, but it's
not a complete fix.

Comment 14 Ondřej Lysoněk 2019-07-31 12:58:41 UTC

(In reply to Welterlen Benoit from comment #0)
> Actual results:
> Entries in utmp broken, then the file size increases and performances fall

I'd say it's in fact normal that the file keeps growing and
subsequently the utmp file processing takes longer. I think the
performance could be improved however by changing pututxline() to lock
only the relevant portion of the file for writing - currently it locks
the whole file, which increases contention. I think locking the whole
file is unnecessary. I'll file a bug for this.

Comment 15 Ondřej Lysoněk 2019-07-31 12:59:33 UTC

(In reply to Ondřej Lysoněk from comment #13)
> This could, in theory be achieved by repeating only pututxline() - not
> both setutxent() and pututxline(). However there is currently a glibc
> bug that prevents us from doing that - I'll file the bug shortly.

I've filed bug#1734791 for this.

Comment 16 Ondřej Lysoněk 2019-08-01 11:45:28 UTC

(In reply to Ondřej Lysoněk from comment #13)
> But until the glibc bug is fixed, there's not much I can do in
> vsftpd. I can implement a partial fix, where the
> setutxline()-pututxline() combo is repeated only a couple of times and
> give up after the retries are exhausted. This may help a bit, but it's
> not a complete fix.

TL;DR:
I ran some experiments to determine if repeating
setutxline()-pututxline() is worth it. From the results of the
experiment it seems that repeating it once improves the situation
significantly, however anything more than that does not seem to be
worth it.

Long version:
The experiment was to run the following 'for' loop twice in sequence:
for i in {1..10000}; do echo -en "user user$((i%51))\\npass user$((i%51))\\nquit\\n" | nc localhost 21 & done > log; wait;

I measured the time the second iteration has taken and counted the
number of bad utmp entries when the test finished.

I ran the experiment with unpatched vsftpd and several versions of
patched vsftpd - the patches differed in the number of
setutxent()-pututxline() attempts.

The results follow.

Test with unpatched vsftpd (number of utmp insert/remove attempts: 1)
- Time taken: 4m6.468s
- Number of bad utmp entries: 7192

Test with patched vsftpd, number of utmp insert/remove attempts: 2
- Time taken: 4m39.467s
- Number of bad utmp entries: 1321

Test with patched vsftpd, number of utmp insert/remove attempts: 3
- Time taken: 5m0.778s
- Number of bad utmp entries: 1300

Test with patched vsftpd, number of utmp insert/remove attempts: 8
Time taken: 6m22s
Number of bad utmp entries: 1003

From these results we can see that raising the number of insert/remove
attempts to 2 reduces the number of bad utmp entries
significantly. Incresing the number of attempts further does
not significantly affect the number of bad utmp entries, whereas the
execution time grows approximately linearly.

Comment 18 Ondřej Lysoněk 2019-08-05 11:23:05 UTC

Clearing the dependency on glibc bug#1734791. We'll use this bug to do the partial fix. I've cloned this bug to bug#1737433 - we'll use that to fully resolve the problem after the glibc bug is fixed.

Comment 30 errata-xmlrpc 2020-03-31 19:12:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1010

Note You need to log in before you can comment on or make changes to this bug.