1705641 – Random processes consuming inordinate amount of RAM

Bug 1705641 - Random processes consuming inordinate amount of RAM

Summary: Random processes consuming inordinate amount of RAM

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	libtirpc
Sub Component:
Version:	30
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Steve Dickson
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (5):	1714117 1716382 1724086 1746844 1764811 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-02 16:12 UTC by Leif Hedstrom
Modified:	2020-05-26 18:42 UTC (History)
CC List:	21 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-05-26 18:42:16 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Leif Hedstrom 2019-05-02 16:12:34 UTC

I don't know what this does, or where it comes from, but after upgrading to F30 on a few boxes, I'm seeing this in top:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
13506 root      20   0   52.2g  30.7g     48 R  46.0  98.0   0:22.19 (ie_check)


Killing this seems to help, and it does not come back. I rebooted, and then a different process started acting up:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 7320 root      20   0   52.1g  30.7g     48 R  51.5  97.9   0:09.44 (resolved)

and then again

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 7344 root      20   0   52.1g  30.2g   4708 R 100.0  96.2   0:08.85 (distccd)


Doing a "systemctl restart distcc" ends up with the same problem in distccd, but somehow I don't think it's related to distcc itself (but not sure). Hence, marking it for systemd right now, it's the only thing I can think of being related to such randomness.


The box is generally unhappy at this point, until the rogue process(es) are killed, since well, it's out of memory, and swapping feverishly.

Comment 1 Leif Hedstrom 2019-05-02 16:18:26 UTC

Hmmm, I take that back... Maybe it is distcc related, I disabled (systemctl disable distccd), and things looks happier.

Comment 2 Leif Hedstrom 2019-05-02 16:20:59 UTC

Gah, I'm wrong, another machine now messed up even with distccd disabled. This one hangs with "(resolved)" using gobs of memory.

Comment 3 Leif Hedstrom 2019-05-02 19:23:05 UTC

systemctl disable systemd-resolved 

seems to help ... No idea what that breaks though.

Comment 4 Leif Hedstrom 2019-05-02 19:30:38 UTC

I think, but not 100% certain, that problems starts to happen when it runs (from syslog):

May  2 13:29:27 milou systemd[1]: Started Check PMIE instances are running.

Comment 5 Leif Hedstrom 2019-05-02 19:46:32 UTC

Killing PCP in addition to killing systems-resolved seems to help a fair amount.

Comment 6 m 2019-05-09 15:26:32 UTC

I am having a similar issue, that I assume is related. My entries in top look very similar to those of the original poster.

It's a major problem since any time a user logs in (ssh or locally), or a cronjob starts, the system eats up all RAM and grinds to a halt for 20+ seconds or so.

First saw it on an old system that had been upgraded quite a few times.  Then did a fresh install and the problem was still there.  It did not become an issue until I enabled NIS logins.  It seems to be related to NIS/PAM/systemd.

Background:
I had to follow directions from comment 16 (https://bugzilla.redhat.com/show_bug.cgi?id=1575297) to get NIS to work. 

Symptoms:

###
 1. On login, some process (usually just listed as systemd) eats up all 64GB of RAM (within about 5-10 seconds) and most of my swap until it crashes.  Then all is mostly fine.

  - I have seen geoclue, systemd-resolved, colord, and leshootd all reported by top as the culprit as well as the process reported by oom-killer.
  - Most of the time the culprit is simply listed as systemd.
  - While the process is happening, systemd-cgtop reports that the memory/cpu usage is within init.scope
  - This also occurs for cronjobs.

###
2. User manager fails to launch:
--- systemctl output:
● user                                                                         loaded failed failed    User Manager for UID 2727    

--- systemctl status user@2727:
● user - User Manager for UID 2727
   Loaded: loaded (/usr/lib/systemd/system/user@.service; static; vendor preset: disabled)
   Active: failed (Result: protocol) since Wed 2019-05-08 18:56:18 EDT; 16h ago
     Docs: man:user@.service(5)
  Process: 3995 ExecStart=/usr/lib/systemd/systemd --user (code=exited, status=224/PAM)
 Main PID: 3995 (code=exited, status=224/PAM)

May 08 18:56:04 brainy.arlut.utexas.edu systemd[1]: Starting User Manager for UID 2727...
May 08 18:56:17 brainy.arlut.utexas.edu systemd[3995]: pam_unix(systemd-user:session): session opened for user XXXX by (uid=0)
May 08 18:56:17 brainy.arlut.utexas.edu systemd[3995]: PAM failed: Cannot allocate memory
May 08 18:56:17 brainy.arlut.utexas.edu systemd[3995]: pam_unix(systemd-user:session): session closed for user XXXX
May 08 18:56:17 brainy.arlut.utexas.edu systemd[3995]: user: Failed to set up PAM session: Cannot allocate memory
May 08 18:56:17 brainy.arlut.utexas.edu systemd[3995]: user: Failed at step PAM spawning /usr/lib/systemd/systemd: Cannot allocate memory
May 08 18:56:18 brainy.arlut.utexas.edu systemd[1]: user: Failed with result 'protocol'.
May 08 18:56:18 brainy.arlut.utexas.edu systemd[1]: Failed to start User Manager for UID 2727.

--- systemctl --user
Failed to list units: Process org.freedesktop.systemd1 exited with status 1

--- systemctl --user status
Failed to read server status: Process org.freedesktop.systemd1 exited with status 1

Comment 7 m 2019-05-09 15:34:38 UTC

One follow up comment... NIS is being used in conjunction with ypbind.

Comment 8 Leif Hedstrom 2019-05-09 15:51:09 UTC

Yeh, I’m using NIS as well.

Comment 9 Leif Hedstrom 2019-05-20 17:02:12 UTC

So, should we move this to the NIS/YP component? I tested some more, it's definitely looking to be NIS related, I see e.g.

#0  0x00007f95387eaf58 in pthread_cond_init@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f9536e9dd06 in clnt_dg_create () from /lib64/libtirpc.so.3
#2  0x00007f9536e9e082 in clnt_tli_create () from /lib64/libtirpc.so.3
#3  0x00007f9536ea50a3 in ?? () from /lib64/libtirpc.so.3
#4  0x00007f9536ea5c7e in ?? () from /lib64/libtirpc.so.3
#5  0x00007f9536e9e255 in clnt_tp_create_timed () from /lib64/libtirpc.so.3
#6  0x00007f9536e9e409 in clnt_create_timed () from /lib64/libtirpc.so.3
#7  0x00007f9536ecbf5b in ?? () from /lib64/libnsl.so.2
#8  0x00007f9536ecc68b in ?? () from /lib64/libnsl.so.2
#9  0x00007f9536eccd32 in yp_all () from /lib64/libnsl.so.2
#10 0x00007f9536ee62ac in _nss_nis_initgroups_dyn () from /lib64/libnss_nis.so.2
#11 0x00007f9538f17c06 in internal_getgrouplist () from /lib64/libc.so.6
#12 0x00007f9538f17f5f in initgroups () from /lib64/libc.so.6
#13 0x000055c09c1715aa in ?? ()
#14 0x000055c09c1f5b85 in ?? ()
#15 0x000055c09c1c7204 in ?? ()
#16 0x000055c09c1c9fff in ?? ()
#17 0x000055c09c1ca440 in ?? ()
#18 0x000055c09c1f0390 in ?? ()
#19 0x000055c09c1d0d0a in ?? ()
#20 0x00007f9538c2ba52 in ?? () from /usr/lib/systemd/libsystemd-shared-241.so
#21 0x00007f9538c2bebd in sd_event_dispatch () from /usr/lib/systemd/libsystemd-shared-241.so
#22 0x00007f9538c2c048 in sd_event_run () from /usr/lib/systemd/libsystemd-shared-241.so
#23 0x000055c09c212698 in ?? ()
#24 0x000055c09c16749d in ?? ()
#25 0x00007f9538e77f33 in __libc_start_main () from /lib64/libc.so.6
#26 0x000055c09c16855e in ?? ()


The process, distccd, ends up dieing after a while.

Comment 10 m 2019-05-20 18:47:56 UTC

Update to the situation...  Problem disappeared after moving hard drives to newer system.

Coworker mentioned that he had seen some mysterious hardware-specific bugs in the past, and that it could be that.

There was a brand new desktop in my office, so I literally transplanted the hard drives to the new machine and the problem has vanished.

Initial, problematic computer:
Intel 6700K - 64 GB RAM

New Computer:
Intel 9900K - 16 GB RAM

*big shrug*

Comment 11 Leif Hedstrom 2019-05-20 23:48:31 UTC

I have the same problem on 5 different systems, two different sets of hardware.

Comment 12 m 2019-05-21 20:51:02 UTC

And... it's back.  I moved the RAM from the 6700K system to the 9900K system and the bug is back.

Odd that it doesn't seem to trigger when there's not a lot of RAM.

Comment 13 Bernard Piette 2019-06-03 13:35:11 UTC

I have just reported a similar bug : https://bugzilla.redhat.com/show_bug.cgi?id=1716382

For me it is clearly nis based. When I remove nis from nsswhich.conf, all works fine (except that I can only log in as root). I am not sure if the problem is in nss_nis, systemd or even pam.

Comment 14 m 2019-06-03 16:13:59 UTC

I ended up risking it and installing Fedora 30 on a new workstation (2x Intel Xeon Gold 6154 with 384GB of RAM).  The problem remains, but there is a limit to the RAM growth.  Eventually things stabilized with (sd-pam) consuming about 106G of RAM and not crashing.  systemd would still routinely pop up on new login and consume a ton of RAM and then vanish.  

Unfortunately, I don't have a lot of time to debug this and had no choice but to do a clean install of Fedora 29 on this machine.  This issue is not at all present on Fedora 29.

Comment 15 m 2019-06-03 16:20:49 UTC

Interesting... Just ran dnf update and saw some nss updates.  The original machine that was having this problem now seems to be behaving.  I will reboot at some point to truly confirm, but may not be able to for a few days.

Comment 16 Leif Hedstrom 2019-06-03 21:34:40 UTC

dns update did not resolve the issues on my hosts (just ran it, and rebooted).

Comment 17 Bernard Piette 2019-06-04 12:44:40 UTC

The main problem is that systemd creates subprocesses that use such a huge amount of memory that the computer becomes unusable.
To try to solve that problem I tried to restrict the memory used by systemd by adding the following lines

DefaultLimitDATA=4G
DefaultLimitSTACK=8M

to /etc/systemd/system.conf, but this did not make much of a difference.

I have also tried to use at the same time

DefaultLimitAS=4G

but with no difference.

I have actually tried a range of values, but none of them helped in any way. If set too small, then the system fails to boot properly.

Is it normal the systemd ignores those values, or at least seems to? Is there any other way to restrict the memory of systemd?

Comment 18 Bernard Piette 2019-06-06 07:03:01 UTC

Having done more tests, I can now confirm that the problem occurs each time a deamon that needs to consult a nis table is started. By reducing the number of nis entries in nsswhich.conf, one reduces the number of instances when systemd generates huge processes.

For example is hosts has a nis entry in nsswhich.conf , resolved uses 52,2 GB of ram, but if nis is removed, then it works fine.

Comment 19 RobbieTheK 2019-07-09 17:36:19 UTC

I posted a similar sounding issue at https://bugzilla.redhat.com/show_bug.cgi?id=1721574. Also note your typo: /etc/nsswitch.conf

Comment 20 Zbigniew Jędrzejewski-Szmek 2019-08-05 20:56:21 UTC

> I can now confirm that the problem occurs each time a deamon that needs to consult a nis table is started.

OK, I have no idea what is going on here, but since there are multiple confirmations that this is related
to NIS, I'll reassign. I think that if this was a problem in systemd, we'd see many more reports.

Comment 21 Zbigniew Jędrzejewski-Szmek 2019-08-05 20:59:32 UTC

*** Bug 1714117 has been marked as a duplicate of this bug. ***

Comment 22 Zbigniew Jędrzejewski-Szmek 2019-08-05 21:06:22 UTC

*** Bug 1724086 has been marked as a duplicate of this bug. ***

Comment 23 Zbigniew Jędrzejewski-Szmek 2019-08-21 15:05:09 UTC

See also the discussion in https://github.com/systemd/systemd/pull/13359.

Comment 24 Grant Gray 2019-09-03 04:20:38 UTC

As a workaround, consider overriding the file descriptor limit for affected services, such as ypbind, in the systemd unit file.

e.g:

[Service]
...
LimitNOFILE=1024


Works for me.

Comment 25 Zbigniew Jędrzejewski-Szmek 2019-10-23 19:56:17 UTC

*** Bug 1764811 has been marked as a duplicate of this bug. ***

Comment 26 Zbigniew Jędrzejewski-Szmek 2019-10-24 19:22:26 UTC

*** Bug 1746844 has been marked as a duplicate of this bug. ***

Comment 27 Johannes Niediek 2019-10-30 13:33:25 UTC

Could you specify how you find out where exactly to put this line?

For me, putting it into /usr/lib/systemd/system/ypbind.service does not solve the problem at all.

My machine is practically unusable because of this issue.

Comment 28 Johannes Niediek 2019-11-04 13:23:51 UTC

Does anybody know if this problem is fixed in Fedora 31? The following bugs all seem to stem from the same underlying cause, and they are all open:

https://bugzilla.redhat.com/show_bug.cgi?id=1705641
https://bugzilla.redhat.com/show_bug.cgi?id=1716382
https://bugzilla.redhat.com/show_bug.cgi?id=1724423


Thank you.

Comment 29 Francis Montagnac 2019-11-04 14:15:16 UTC

> Does anybody know if this problem is fixed in Fedora 31?

I just checked: this problem is *NOT* fixed. See traces below.

As I said in this other bug:

  https://bugzilla.redhat.com/show_bug.cgi?id=1714117

the problem lies in initgroups when nis is in the group line in
/etc/nsswitch.conf

### Attempt to connect with ssh as a normal user: really long

    time ssh local37 echo OK
    OK

    real    0m26.366s
    user    0m0.083s
    sys     0m0.020s
    $ fm@kermit 2019-11-04 15:05:55 ~

### Logs on the impacted machine

    uname -r
    5.3.7-301.fc31.x86_64

    ### Huge amount of memory used by systemd
    lps -a +etime-2m -w
    USER         PID    PPID     VSZ     RSS    TTY ELAPSED  CPUTIME %CPU S COMMAND
				  Mb      Mb         d.hhmm   h.mmss        
    root      131531       1 54699.9 30999.0      ?  0.0000   0.0027 98.4 R (systemd)
    root      131521       2     0.0     0.0      ?  0.0000   0.0000  0.0 I [kworker/0:3-cgroup_destroy]
    root      131520       2     0.0     0.0      ?  0.0000   0.0000  0.0 I [kworker/0:1-events]
    root      131519       2     0.0     0.0      ?  0.0000   0.0000  0.0 I [kworker/3:0-events]
    root      131518       2     0.0     0.0      ?  0.0000   0.0000  0.0 I [kworker/2:1-events_power_efficient]
    # root@local37 2019-11-04 15:05:58 ~

    journalctl --since -2m -p 0..3
    -- Logs begin at Sun 2019-11-03 20:00:09 CET, end at Mon 2019-11-04 15:06:48 CET. --
    Nov 04 15:05:55 local37 sshd[131528]: pam_systemd(sshd:session): Failed to create session: Connection timed out
    # root@local37 2019-11-04 15:06:49 ~

    ### The kernel finish by killing this systemd process
    Nov 04 15:06:48 local37 kernel: (systemd) invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
    Nov 04 15:06:48 local37 kernel: CPU: 2 PID: 131531 Comm: (systemd) Not tainted 5.3.7-301.fc31.x86_64 #1
    <snip>

Comment 30 Leif Hedstrom 2019-11-04 14:24:47 UTC

I gave up a while ago, and turned off NIS. Systemd is so broken, it’s not worth trying to use or fight it... so, I have no way of testing this either, but seems it’s still broken.

Comment 31 Leif Hedstrom 2019-11-04 14:25:05 UTC

I gave up a while ago, and turned off NIS. Systemd is so broken, it’s not worth trying to use or fight it... so, I have no way of testing this either, but seems it’s still broken.

Comment 32 RobbieTheK 2019-11-06 19:20:35 UTC

Is any of the NIS developers going to look at this? I see kwapd is using memory, pretty sure it's related. Any other debug I can provide?

top - 14:03:17 up 4 days, 16:53,  1 user,  load average: 6.35, 2.99, 1.73
Tasks: 259 total,   2 running, 257 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.6 us,  7.2 sy,  0.0 ni, 77.1 id, 14.1 wa,  0.3 hi,  0.6 si,  0.0 st
MiB Mem :  63879.6 total,    243.3 free,  63535.1 used,    101.3 buff/cache
MiB Swap:  22892.0 total,  12119.4 free,  10772.6 used.    168.5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 138311 root      20   0  104.2g  61.4g    216 R  35.2  98.4   1:31.41 (systemd)
    205 root      20   0       0      0      0 D  28.7   0.0   5:02.77 kswapd1
    204 root      20   0       0      0      0 D  24.9   0.0   4:57.25 kswapd0

Comment 33 RobbieTheK 2019-12-17 14:30:50 UTC

See https://bugzilla.redhat.com/show_bug.cgi?id=1716382#c24, try enabling NSCD and kindly respond here to see if it helps.

Comment 34 Francis Montagnac 2019-12-17 15:53:47 UTC

I confirm: enabling nscd (with enable-cache group yes) makes this
problem disappear.

IMHO:
  - this is only a turnaround
  - we should forget NIS ... after so many years of good services

Comment 35 Christopher Heiny 2019-12-17 17:20:34 UTC

(In reply to RobbieTheK from comment #33)
> See https://bugzilla.redhat.com/show_bug.cgi?id=1716382#c24, try enabling
> NSCD and kindly respond here to see if it helps.

Thanks Robbie! Just to clarify before I go experimenting: we want to stop/disable yobind and enable/start nscd. Is that correct? Or merely enable/start nscd?

Thanks again,
Chris

Comment 36 RobbieTheK 2019-12-17 17:35:32 UTC

(In reply to Christopher Heiny from comment #35)
> (In reply to RobbieTheK from comment #33)
> > See https://bugzilla.redhat.com/show_bug.cgi?id=1716382#c24, try enabling
> > NSCD and kindly respond here to see if it helps.
> 
> Thanks Robbie! Just to clarify before I go experimenting: we want to
> stop/disable yobind and enable/start nscd. Is that correct? Or merely
> enable/start nscd?

The latter, enable/start nscd, ;eave ypbind running.

Comment 37 Christopher Heiny 2020-02-11 23:16:06 UTC

(In reply to RobbieTheK from comment #36)
> (In reply to Christopher Heiny from comment #35)
> > (In reply to RobbieTheK from comment #33)
> > > See https://bugzilla.redhat.com/show_bug.cgi?id=1716382#c24, try enabling
> > > NSCD and kindly respond here to see if it helps.
> > 
> > Thanks Robbie! Just to clarify before I go experimenting: we want to
> > stop/disable yobind and enable/start nscd. Is that correct? Or merely
> > enable/start nscd?
> 
> The latter, enable/start nscd, ;eave ypbind running.

Sorry for the long delay in replying - I got yanked to a different project, and am now just getting back to the one affected by this.

Enabling nscd has rendered our F30 configuration useable, so we can push it to production.  Hooray and thank you!

Chris

Comment 38 Filip Januš 2020-04-06 16:50:32 UTC

Hi,
after spending hours with debugging, I think, I found problem between systemd-240 and libtirpc. I have prepared patch [1], now I need to test it, but my problem is, that I am not able to reproduce this bug. Is here someone who would be able to test my patch and check whether bug persists or not?  

But one point, this isn't final fix even if it would work.

Thanks ! 



[1] https://fjanus.fedorapeople.org/libtirpc-1.2.5-2.rc2.fc33.x86_64.rpm

Comment 39 Francis Montagnac 2020-04-07 16:46:35 UTC

It works with libtirpc-1.2.5-2.rc2.fc33.x86_64.rpm

From my test VM with:

  mem: 64G (for this test)
  kernel-5.3.11-300.fc31.x86_64
  systemd-243.4-1.fc31.x86_64
  nsswitch.conf: group:      files nis systemd
  nscd disabled

I checked first that the problem was still there before upgrading
libtirpc.

I'm curious to know (in brief) the reasons of this bug. I spent some
time to trace and thought it was in initgroups.

Thanks.

Comment 40 Filip Januš 2020-04-07 18:24:34 UTC

Thanks a lot for your cooperation. 
Since systemd-240 there increased is hard limit of possible file descriptors  to 512K, see [1] systemd changelog. And Libtirpc uses this constant for calculation needed memory to allocating and for some other actions. 



[1] https://github.com/systemd/systemd/blob/master/NEWS

Comment 41 Leif Hedstrom 2020-04-07 20:30:10 UTC

That's some incredible detective work! And, a really sad story how systemd can put an entire subsystem / eco system in complete disarray with one small change...

Comment 42 Filip Januš 2020-04-20 16:10:29 UTC

Because this bug comes from libtirpc not from NIS, I will assign it to libtirpc.

Comment 43 Filip Januš 2020-04-22 10:50:51 UTC

*** Bug 1716382 has been marked as a duplicate of this bug. ***

Comment 44 Ben Cotton 2020-04-30 20:12:15 UTC

This message is a reminder that Fedora 30 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 30 on 2020-05-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '30'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 30 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 45 Ben Cotton 2020-05-26 18:42:16 UTC

Fedora 30 changed to end-of-life (EOL) status on 2020-05-26. Fedora 30 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.

b.m.a.g.piette
christopherheiny
fjanus
francis.montagnac
grant
gwync
igeorgex
jimbodie
jlayton
jonied
lnykryn
m
mmuzila
msekleta
pawel_sikora
phil_g
rkudyba
s
steved
systemd-maint
zbyszek