Created attachment 1127487 [details]
Output of "sudo journalctl" and "systemctl -t service"
Description of problem:
On a live webserver running CentOS 7.2, the systemd (PID 1) process has a memory leak of about 200 MB per day, currently up to 3.7 GB of RAM usage after 18 days uptime. Reboot of the server is periodically required to free the memory.
Version-Release number of selected component (if applicable):
systemd version 219
Reproducible on this particular server by simply rebooting and watching RAM usage grow over time.
RAM usage of the PID 1 process increases by ~200 MB per day.
RAM usage should not increase.
The only heavy activity in the logs shown by "sudo journalctl" is related to numerous rsync SSH connections made by another production server. I've attached a sample of the journal log with real hostnames and IP addresses redacted. I've also attached the output of "systemctl -t service".
Sounds like https://github.com/systemd/systemd/issues/1961
(In reply to Lukáš Nykrýn from comment #2)
> Sounds like https://github.com/systemd/systemd/issues/1961
Well not quite. The CPU is not spiked to 100% here, and running "systemctl list-unit-files" only results in ~60 session-*.scope* units. I also see no logind failures in "sudo journalctl -b -u systemd-logind" as in that issue.
There are 86 scope files and associated directories in /run/systemd/system/ on this server, which amount to ~20MB of disk space. I am seeing a lot of these files are up to 6 days old. Is this normal? The server has been up for 19 days so if this was the source of the leak I would have expected to see orphaned files as old as 19 days as well.
Here is some output from systemd-cgtop showing resource usage of each active control group. Note that the problem is only showing up in the "root" path.
>Path Tasks %CPU Memory Input/s Output/s
> 296 30.5 11.3G 657.8K 893.0K
>system.sliceNetworkManager.service 1 - - - -
>system.sliceauditd.service 1 - - - -
>system.slicecrond.service 1 - - - -
>system.slicedbus.service 1 - - - -
>system.sliceirqbalance.service 1 - - - -
>system.slicelvm2-lvmetad.service 1 - - - -
>system.slicemariadb.service 2 - - - -
>system.slicenginx.service 10 - - - -
>system.slicephp-fpm.service 101 - - - -
>system.slicepolkit.service 1 - - - -
>system.slicepostfix.service 3 - - - -
>system.slicersyslog.service 1 - - - -
>system.slicesmartd.service 1 - - - -
>system.slicesshd.service 2 - - - -
>firstname.lastname@example.org 1 - - - -
>system.slicesystemd-journald.service 1 - - - -
>system.slicesystemd-logind.service 1 - - - -
>system.slicesystemd-udevd.service 1 - - - -
>system.slicetuned.service 1 - - - -
>system.slicewpa_supplicant.service 1 - - - -
>user.slice/user-1000.slice/session-7170741.scope 4 - - - -
If you run systemctl daemon-reexec does it decrease the amount of allocated memory?
Can you also attach output of systemd-analyze dump?
Created attachment 1127642 [details]
Output of "systemd-analyze dump"
(In reply to Lukáš Nykrýn from comment #5)
> If you run systemctl daemon-reexec does it decrease the amount of allocated
> Can you also attach output of systemd-analyze dump?
Running systemctl daemon-reexec does release all of the used RAM. The question is whether the leak will continue. It has persisted through reboots before. Does the result of this command provide any insight into the cause of the leak?
I've attached the output of "systemd-analyze dump" prior to issueing the daemon-reexec command.
Created attachment 1127656 [details]
Output of atop during server high load
Here's another example of abnormal behavior of systemd.
I've attached the output of atop during a period when the server was under high load due to production tasks. These tasks involve downloading, reading, and writing lots of data on the /home partition. However, a huge fraction of the disk I/O is taking place in the root partition (LVM centos-root on the left-hand side of the output), which should not be the case. This is coupled with atop showing systemd being responsible for the majority of the disk usage, presumably taking place in that root partition. This is another example of what seems like abnormal behavior.
Would you be willing to try a test build? We found one memory-leak.
(In reply to Lukáš Nykrýn from comment #10)
> Would you be willing to try a test build? We found one memory-leak.
Well this is a live web server so I'm a little wary. Is this leak you found capable of leaking 200 MB/day as I have observed?
I am sorry, but I don't know. I will try to find some artificial reproducer and try the fix myself.
I am observing this memory leak on my ubuntu xenial server. Willing to give you whatever information you want and try whatever you have to fix it.
This problem should be fixed in systemd-219-30. If anyone is willing to try that, we have a repo with test builds here: https://copr.fedorainfracloud.org/coprs/lnykryn/systemd-rhel-staging/
(In reply to Lukáš Nykrýn from comment #14)
> This problem should be fixed in systemd-219-30. If anyone is willing to try
> that, we have a repo with test builds here:
We are experiencing this problem on a production server. Does 219-30 fix it? I see its available in RHEL 7.3 beta.
If I am not mistaken 7.3 should be out now. So you can try the latest version there.
(In reply to Lukáš Nykrýn from comment #16)
> If I am not mistaken 7.3 should be out now. So you can try the latest
> version there.
Indeed. I will give an update here as soon as it shows up in the CentOS repository.
Created attachment 1225281 [details]