Bug 2125705 - Spausedd consumes all system memory
Summary: Spausedd consumes all system memory
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: corosync
Version: 7.6
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Jan Friesse
QA Contact: cluster-qe
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-09-09 19:11 UTC by Joshua Baker
Modified: 2023-08-10 15:39 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker KCSOPP-2112 0 None None None 2022-09-09 19:15:06 UTC
Red Hat Issue Tracker RHELPLAN-133699 0 None None None 2022-09-09 19:29:28 UTC
Red Hat Knowledge Base (Solution) 7000100 0 None None None 2023-02-28 15:40:22 UTC

Description Joshua Baker 2022-09-09 19:11:21 UTC
Description of problem:

From "ps" output:
~~~
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
                           ~11.89 Gb VV
root      1416  0.6 76.6 12529176 12475996 ?   SLs  Jan03 2212:18 /usr/bin/spausedd -D
~~~

From /proc/<pid>/maps for spaused:
~~~
$ gawk --non-decimal-data '{split($1,a,"-"); $7=sprintf("0x%s",a[1] ); $8=sprintf("0x%s",a[2]); gsub(/ /,"",$7); gsub(/ /,"",$8);  printf "%8d kb %4s %8s %5s %-6s           %7s\n", ( $8 - $7 ) / 1024,$2,$3,$4,$5,$6}' spausedd-maps.txt 
      16 kb r-xp 00000000 fd:00 8502902           /usr/bin/spausedd
       4 kb r--p 00003000 fd:00 8502902           /usr/bin/spausedd
       4 kb rw-p 00004000 fd:00 8502902           /usr/bin/spausedd
     132 kb rw-p 00000000 00:00 0                 [heap]
12300984 kb rw-p 00000000 00:00 0                 [heap]   <--- ~11.7 Gb of heap
~~~

It is unclear at this time how to reproduce this issue, or under what circumstances this issue occurs. 


Version-Release number of selected component (if applicable):

- kernel-3.10.0-957.54.1.el7.x86_64
- spausedd-2.4.5-7.el7_9.1.x86_64

How reproducible:
Not Known


Additional info:
This seems like a possible memory leak so valgrind could be helpful here. Looking for further recommendations for data collection or next steps.

Comment 4 Jan Friesse 2022-09-12 12:55:05 UTC
Hi Joshua,
thank you for report. It looks scary. as Reid mentioned, there is no direct malloc in spausedd during runtime so I would guess it's probably problem in VMGuestLib (or maybe spausedd is not calling some free function?). Is spausedd running on VMWare VM (log should contain "Using VMGuestLib")?

Honestly, without reproducer I don't see any way how to move forward. Of course valgrind could help identify problem, ideally when we find reproducer. Of course it's worth a try to run:

```
valgrind --leak-check=full --show-leak-kinds=all spausedd -f
```

and after some time
```
kill -ABRT `pidof valgrind`
```

If there will be leaks, it should show them quckly.

Comment 6 Jan Friesse 2022-09-20 08:06:34 UTC
Joshua:
weird you weren't able to find the log message (it should be there), but I expect it is using vmguestlib anyway. Let me know the result. Sadly I don't have any vmware to test it myself :(

Comment 7 Reid Wahl 2022-09-20 08:22:33 UTC
spausedd had been running since Jan 3 and the logs only go back to July, so it's not too surprising that there's no "Using VMGuestLib" message.

Comment 8 Jan Friesse 2022-09-20 14:27:03 UTC
Thanks to Reid I've got access to VMWare host and create VM with RHEL 7. Problem is, that valgrind is unable to recognize instruction used by VMGuestLib so the result is:
==2558== valgrind: Unrecognised instruction at address 0x58878af.
==2558==    at 0x58878AF: Hostinfo_TouchXen (hostinfoHV.c:255)
==2558==    by 0x588284D: VmCheckSafe (vmcheck.c:169)
==2558==    by 0x58829DA: VmCheck_IsVirtualWorld (vmcheck.c:299)
==2558==    by 0x5041088: VMGuestLib_OpenHandle (vmGuestLib.c:262)
==2558==    by 0x40163D: guestlib_init (spausedd.c:542)
==2558==    by 0x40163D: main (spausedd.c:787)
==2558== Your program just tried to execute an instruction that Valgrind
==2558== did not recognise.  There are two possible reasons for this.
==2558== 1. Your program has a bug and erroneously jumped to a non-code
==2558==    location.  If you are running Memcheck and you just saw a
==2558==    warning about a bad jump, it's probably your program's fault.
==2558== 2. The instruction is legitimate but Valgrind doesn't handle it,
==2558==    i.e. it's Valgrind's fault.  If you think this is the case or
==2558==    you are not sure, please let us know and we'll try to fix it.
==2558== Either way, Valgrind will now raise a SIGILL signal which will
==2558== probably kill your program.
Sep 20 16:08:20 spausedd: Can't open guestlib handle: VMware Guest API is not running in a Virtual Machine

And because of this, spausedd running using valgrind behaves like not having vmguestlib :( So to conclude, it doesn't make any sense to ask customer for using valgrind.

Anyway, I've at least tried to reproduce problem as fast as possible and set timeout to 1 (spausedd -D -t 1) and result after 1 hour RSS was 4776 . I will let it run for a few days and see what happens, but right now, I don't have any solution or idea how to move forward.

Any chance customer can try newer RHEL so we at least move to better supported territory (ideally RHEL 9)?

Comment 9 Jan Friesse 2022-09-20 14:39:31 UTC
Found something (maybe) interesting https://github.com/vmware/open-vm-tools/issues/514#issuecomment-839107895 - we have 11.0.5.

Have you seen vmtoolsd being similarly big rss for vmtoolsd? If so, we may give a try to workaround option described - not really sure if it helps spausedd, but it's worth a try?

Comment 10 Reid Wahl 2022-09-21 07:44:43 UTC
(In reply to Jan Friesse from comment #9)
> Found something (maybe) interesting
> https://github.com/vmware/open-vm-tools/issues/514#issuecomment-839107895 -
> we have 11.0.5.
> 
> Have you seen vmtoolsd being similarly big rss for vmtoolsd? If so, we may
> give a try to workaround option described - not really sure if it helps
> spausedd, but it's worth a try?

vmtools seems to be doing okay.

root      1393  0.0  0.0 396576  5676 ?        Ssl  Jan03 181:31 /usr/bin/vmtoolsd


When I was first looking at this, I found several memory leak fixes in the change log for vmguestlib. None that I could really point to for this issue, but it lends further support to the notion that the problem may reside there. It could be a leak that's been fixed in some newer version, or a leak that hasn't even been identified yet.

Comment 11 Jan Friesse 2022-09-21 14:30:22 UTC
:(

Anyway, I think it is not a problem in spausedd itself, because after day of running I see
root      3156 99.9  0.1  23184  4776 ?        RLs  Sep20 1449:45 spausedd -D -t 1

(this is pristine RHEL 7.9 without any config). So I think it is some combination of vmguestlib and configuration of vmware vm which may cause problems - memory leaks.

And honestly, I have really NO clue how to move forward.

Comment 12 Reid Wahl 2022-09-21 20:13:41 UTC
Well, a workaround would be to restart spausedd I presume. Someone could take this to VMware, but I don't expect that to go far since we can't reproduce it and it's technically our program that shows the leak.

Comment 13 Jan Friesse 2022-09-22 07:30:51 UTC
Reid: Agree. We cannot file issue with VMWare/open-vm-tools upstream without reproducer and confirmation using the upstream guestlib.

Comment 14 Jan Friesse 2022-09-26 07:46:01 UTC
Checked again this morning and memory used is still 4776. So I've restarted it and now I'm testing default timeout.

I'm wondering, is customer using (for example) VM migration or some other "special" features?


Note You need to log in before you can comment on or make changes to this bug.