Description of problem: From "ps" output: ~~~ USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND ~11.89 Gb VV root 1416 0.6 76.6 12529176 12475996 ? SLs Jan03 2212:18 /usr/bin/spausedd -D ~~~ From /proc/<pid>/maps for spaused: ~~~ $ gawk --non-decimal-data '{split($1,a,"-"); $7=sprintf("0x%s",a[1] ); $8=sprintf("0x%s",a[2]); gsub(/ /,"",$7); gsub(/ /,"",$8); printf "%8d kb %4s %8s %5s %-6s %7s\n", ( $8 - $7 ) / 1024,$2,$3,$4,$5,$6}' spausedd-maps.txt 16 kb r-xp 00000000 fd:00 8502902 /usr/bin/spausedd 4 kb r--p 00003000 fd:00 8502902 /usr/bin/spausedd 4 kb rw-p 00004000 fd:00 8502902 /usr/bin/spausedd 132 kb rw-p 00000000 00:00 0 [heap] 12300984 kb rw-p 00000000 00:00 0 [heap] <--- ~11.7 Gb of heap ~~~ It is unclear at this time how to reproduce this issue, or under what circumstances this issue occurs. Version-Release number of selected component (if applicable): - kernel-3.10.0-957.54.1.el7.x86_64 - spausedd-2.4.5-7.el7_9.1.x86_64 How reproducible: Not Known Additional info: This seems like a possible memory leak so valgrind could be helpful here. Looking for further recommendations for data collection or next steps.
Hi Joshua, thank you for report. It looks scary. as Reid mentioned, there is no direct malloc in spausedd during runtime so I would guess it's probably problem in VMGuestLib (or maybe spausedd is not calling some free function?). Is spausedd running on VMWare VM (log should contain "Using VMGuestLib")? Honestly, without reproducer I don't see any way how to move forward. Of course valgrind could help identify problem, ideally when we find reproducer. Of course it's worth a try to run: ``` valgrind --leak-check=full --show-leak-kinds=all spausedd -f ``` and after some time ``` kill -ABRT `pidof valgrind` ``` If there will be leaks, it should show them quckly.
Joshua: weird you weren't able to find the log message (it should be there), but I expect it is using vmguestlib anyway. Let me know the result. Sadly I don't have any vmware to test it myself :(
spausedd had been running since Jan 3 and the logs only go back to July, so it's not too surprising that there's no "Using VMGuestLib" message.
Thanks to Reid I've got access to VMWare host and create VM with RHEL 7. Problem is, that valgrind is unable to recognize instruction used by VMGuestLib so the result is: ==2558== valgrind: Unrecognised instruction at address 0x58878af. ==2558== at 0x58878AF: Hostinfo_TouchXen (hostinfoHV.c:255) ==2558== by 0x588284D: VmCheckSafe (vmcheck.c:169) ==2558== by 0x58829DA: VmCheck_IsVirtualWorld (vmcheck.c:299) ==2558== by 0x5041088: VMGuestLib_OpenHandle (vmGuestLib.c:262) ==2558== by 0x40163D: guestlib_init (spausedd.c:542) ==2558== by 0x40163D: main (spausedd.c:787) ==2558== Your program just tried to execute an instruction that Valgrind ==2558== did not recognise. There are two possible reasons for this. ==2558== 1. Your program has a bug and erroneously jumped to a non-code ==2558== location. If you are running Memcheck and you just saw a ==2558== warning about a bad jump, it's probably your program's fault. ==2558== 2. The instruction is legitimate but Valgrind doesn't handle it, ==2558== i.e. it's Valgrind's fault. If you think this is the case or ==2558== you are not sure, please let us know and we'll try to fix it. ==2558== Either way, Valgrind will now raise a SIGILL signal which will ==2558== probably kill your program. Sep 20 16:08:20 spausedd: Can't open guestlib handle: VMware Guest API is not running in a Virtual Machine And because of this, spausedd running using valgrind behaves like not having vmguestlib :( So to conclude, it doesn't make any sense to ask customer for using valgrind. Anyway, I've at least tried to reproduce problem as fast as possible and set timeout to 1 (spausedd -D -t 1) and result after 1 hour RSS was 4776 . I will let it run for a few days and see what happens, but right now, I don't have any solution or idea how to move forward. Any chance customer can try newer RHEL so we at least move to better supported territory (ideally RHEL 9)?
Found something (maybe) interesting https://github.com/vmware/open-vm-tools/issues/514#issuecomment-839107895 - we have 11.0.5. Have you seen vmtoolsd being similarly big rss for vmtoolsd? If so, we may give a try to workaround option described - not really sure if it helps spausedd, but it's worth a try?
(In reply to Jan Friesse from comment #9) > Found something (maybe) interesting > https://github.com/vmware/open-vm-tools/issues/514#issuecomment-839107895 - > we have 11.0.5. > > Have you seen vmtoolsd being similarly big rss for vmtoolsd? If so, we may > give a try to workaround option described - not really sure if it helps > spausedd, but it's worth a try? vmtools seems to be doing okay. root 1393 0.0 0.0 396576 5676 ? Ssl Jan03 181:31 /usr/bin/vmtoolsd When I was first looking at this, I found several memory leak fixes in the change log for vmguestlib. None that I could really point to for this issue, but it lends further support to the notion that the problem may reside there. It could be a leak that's been fixed in some newer version, or a leak that hasn't even been identified yet.
:( Anyway, I think it is not a problem in spausedd itself, because after day of running I see root 3156 99.9 0.1 23184 4776 ? RLs Sep20 1449:45 spausedd -D -t 1 (this is pristine RHEL 7.9 without any config). So I think it is some combination of vmguestlib and configuration of vmware vm which may cause problems - memory leaks. And honestly, I have really NO clue how to move forward.
Well, a workaround would be to restart spausedd I presume. Someone could take this to VMware, but I don't expect that to go far since we can't reproduce it and it's technically our program that shows the leak.
Reid: Agree. We cannot file issue with VMWare/open-vm-tools upstream without reproducer and confirmation using the upstream guestlib.
Checked again this morning and memory used is still 4776. So I've restarted it and now I'm testing default timeout. I'm wondering, is customer using (for example) VM migration or some other "special" features?