Bug 1305608

Summary: 100% CPU utilization when SSH connections reach ~10 per second
Product: Red Hat Enterprise Linux 7 Reporter: Bryan Totty <btotty>
Component: systemdAssignee: systemd-maint
Status: CLOSED ERRATA QA Contact: Frantisek Sumsal <fsumsal>
Severity: high Docs Contact:
Priority: urgent    
Version: 7.2CC: bblaskov, cww, fkrska, fsumsal, hannsj_uhl, jgenus, killythecat, lnykryn, mmatsuya, msekleta, myamazak, pdhamdhe, pdwyer, rmetrich, snishika, ssahani, sshaurya, syamamot, systemd-maint-list, systemd-maint, yoguma, yzheng, zpytela
Target Milestone: rcKeywords: Regression, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: systemd-219-22.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1360159 (view as bug list) Environment:
Last Closed: 2016-11-04 00:51:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1360159    

Description Bryan Totty 2016-02-08 17:14:17 UTC
Description of problem:

1. Please reference- 

https://github.com/systemd/systemd/issues/1961 

for a background, as it appears to affect RHEL 7.2's systemd 219-19.

2. A test application is continuously making ssh connections at a rate on the order of 10/sec to execute command to the hosts that experience this problem. Each of these remotely executed commands is typically very short in execution time. 

3. We do not believe that rate limiting parameters in the SSH configuration file /etc/ssh/sshd_config defined, such as "MaxSessions" or "MaxStartups" are the root cause or are related in this scenario. 

4. The problem seems to be introduced somewhere in systemd 208 - 219. The exact same application test does not cause the below problems with systemd 208 or older.

5. When the host receiving the the ssh commands is operating well, the systemd-logind will typically look as follows, here you can see that new sessions are being created and removed soon there after.

[root@hostname ~]# journalctl -u systemd-logind -f
-- Logs begin at Mon 2016-02-01 20:36:09 EST. --
Feb 04 15:35:12 hostname.example.com systemd-logind[571]: New session 759045 of user root.
Feb 04 15:35:12 hostname.example.com systemd-logind[571]: New session 759046 of user root.
Feb 04 15:35:12 hostname.example.com systemd-logind[571]: New session 759048 of user root.
Feb 04 15:35:12 hostname.example.com systemd-logind[571]: New session 759047 of user root.
Feb 04 15:35:12 hostname.example.com systemd-logind[571]: New session 759049 of user root.
Feb 04 15:35:12 hostname.example.com systemd-logind[571]: Removed session 759045.
Feb 04 15:35:12 hostname.example.com systemd-logind[571]: Removed session 759047.
Feb 04 15:35:12 hostname.example.com systemd-logind[571]: Removed session 759046.
Feb 04 15:35:12 hostname.example.com systemd-logind[571]: Removed session 759048.
Feb 04 15:35:12 hostname.example.com systemd-logind[571]: Removed session 759049.
Feb 04 15:35:13 hostname.example.com systemd-logind[571]: New session 759050 of user root.
Feb 04 15:35:13 hostname.example.com systemd-logind[571]: Removed session 759050.
Feb 04 15:35:13 hostname.example.com systemd-logind[571]: New session 759051 of user root.
Feb 04 15:35:13 hostname.example.com systemd-logind[571]: Removed session 759051.
Feb 04 15:35:14 hostname.example.com systemd-logind[571]: New session 759054 of user root.
Feb 04 15:35:14 hostname.example.com systemd-logind[571]: New session 759056 of user root.
Feb 04 15:35:14 hostname.example.com systemd-logind[571]: New session 759053 of user root.
Feb 04 15:35:14 hostname.example.com systemd-logind[571]: New session 759052 of user root.
Feb 04 15:35:14 hostname.example.com systemd-logind[571]: New session 759055 of user root.
Feb 04 15:35:14 hostname.example.com systemd-logind[571]: New session 759057 of user root.
Feb 04 15:35:14 hostname.example.com systemd-logind[571]: New session 759058 of user root.
Feb 04 15:35:14 hostname.example.com systemd-logind[571]: Removed session 759054.
Feb 04 15:35:14 hostname.example.com systemd-logind[571]: Removed session 759053.
Feb 04 15:35:14 hostname.example.com systemd-logind[571]: Removed session 759057.
Feb 04 15:35:14 hostname.example.com systemd-logind[571]: Removed session 759055.
Feb 04 15:35:14 hostname.example.com systemd-logind[571]: Removed session 759052.
Feb 04 15:35:14 hostname.example.com systemd-logind[571]: Removed session 759056.
Feb 04 15:35:14 hostname.example.com systemd-logind[571]: Removed session 759058.

When a host running systemd 219 experience the problem (usually some time after 10's of minutes to a couple of hours of the above loading,) the "Remove session" logs will cease and only "New session" logs will be seen.

At this time, or some time soon there after, the systemd-logind process will begin utilizing near 100% CPU and can hang the system. 

6. One indication of the problem is a large number (10s-100s) of logs indicating abandon sessions as seen with the following command: systemctl |grep "of user root" |grep "abandoned"

7. Another indication of this problem is a large and increasing number files (100s-1000s) indicated via "ls /run/systemd/sessions/ | wc -l". 


Version-Release number of selected component (if applicable):

This issue has been narrowed down to having been introduced some time between systemd version 208 (where the problem is not seen and version 219 where the problem is seen.)

Further the version of openssh-server is the same in both in good and bad hosts.

openssh-6.6.1p1-23.el7_2.x86_64
openssh-clients-6.6.1p1-23.el7_2.x86_64
openssh-server-6.6.1p1-23.el7_2.x86_64


How reproducible:
Always after 10 minutes - a few hours.


Steps to Reproduce:
1. Create a script that opens and closes SSH connections at a rate of about 10 per second. Keep in mind the "MaxSessions" or "MaxStartups" parameters mentioned above as they could play a role if the rate or amount of connections tested exceeds what is allowed by SSHD. If the sshd_config limits are exceeded in your configuration file, then the test could be invalidated/masked.

2. Wait for a period of time and monitor with "journalctl -u systemd-logind -f"


Actual results:
100% CPU utilization and possible system hangs.


Expected results:
Graceful handling of the connections.

Comment 2 Lukáš Nykrýn 2016-02-10 12:36:57 UTC
We are working on this, but there is no known solution -> condnack_patch

Comment 9 Lukáš Nykrýn 2016-05-05 12:32:21 UTC
So there was some progress regarding this bug.
There are two sets of patches in upstream for this issue and I am currently working on their backport for rhel7. Unfortunately the changes are really huge, so it will take some time.

Also there might be additional bug in dbus https://bugzilla.redhat.com/show_bug.cgi?id=1333349

Comment 11 Lukáš Nykrýn 2016-05-10 12:25:54 UTC
So I think we have a patch for this: https://github.com/lnykryn/systemd-rhel/pull/19

I have made a test build https://people.redhat.com/lnykryn/systemd/bz1305608/
but please note that it is highly experimental, untested, not supported, ...

Can you please try it?

Comment 12 Branislav Blaškovič 2016-05-16 08:38:49 UTC
qa_acking.

Comment 36 errata-xmlrpc 2016-11-04 00:51:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2216.html