RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1717000 - possible memory leak in Gnome - continues [rhel-7.9.z]
Summary: possible memory leak in Gnome - continues [rhel-7.9.z]
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: mutter
Version: 7.5
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Jonas Ådahl
QA Contact: Desktop QE
URL:
Whiteboard:
Depends On: 1546302 1669048 1753122 1753123 1887862
Blocks: 1719819 1881995 1882009 1882574 1882821 1887741 1888675 1888676 1888678 1888682
TreeView+ depends on / blocked
 
Reported: 2019-06-04 14:06 UTC by Ray Strode [halfline]
Modified: 2022-01-05 22:52 UTC (History)
23 users (show)

Fixed In Version: mutter-3.28.3-28.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1669048
: 1719819 1881217 1881995 1882009 1882574 1882821 1887862 1888682 1899660 (view as bug list)
Environment:
Last Closed: 2020-11-10 13:17:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
debugging patch for glib2 (4.62 KB, text/plain)
2019-09-19 12:48 UTC, Milan Crha
no flags Details


Links
System ID Private Priority Status Summary Last Updated
GNOME Gitlab GNOME/gnome-shell/issues/1415 0 None None None 2019-06-25 20:29:02 UTC
GNOME Gitlab GNOME mutter merge_requests 1225 0 None None None 2020-04-30 14:28:03 UTC
GNOME Gitlab GNOME mutter merge_requests 1449 0 None None None 2020-09-25 18:42:53 UTC
GNOME Gitlab GNOME mutter merge_requests 1451 0 None None None 2020-09-28 13:57:41 UTC

Comment 35 Ray Strode [halfline] 2019-07-17 19:42:52 UTC
So I haven't been able to extract a good trace with that core file...
On my machine I get:

╎❯ gdb /usr/bin/gnome-shell core
...
Reading symbols from /usr/bin/gnome-shell...Reading symbols from /usr/lib/debug/usr/bin/gnome-shell.debug...done.
done.
...
Failed to read a valid object file image from memory.
Core was generated by `/usr/bin/gnome-shell'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f781c36a464 in ?? ()
(gdb) bt
#0  0x00007f781c36a464 in ?? ()
#1  0x0000000001dd0980 in ?? ()
#2  0x00007f7815e2b760 in ?? ()
#3  0x0000000001dd0970 in ?? ()
#4  0x0000000000000080 in ?? ()
#5  0x0000000000000110 in ?? ()
#6  0x0000000000000100 in ?? ()
#7  0x00007f7815aeae42 in ?? ()
#8  0x6f6e000000006968 in ?? ()
#9  0x0000000000000090 in ?? ()
#10 0x0000000000000000 in ?? ()

So gdb is failing to decode the symbols. Also only frame #0, frame #2, and and frame #7 look like they have valid addresses. Interestingly enough we can see the address of the crash in the core file doesn't match the address in the log:

Jul 17 10:26:37 uiz210 kernel: traps: gnome-shell[2894] general protection ip:7f5c4c4f5464 sp:7fff12a87288 error:0 in libmutter-1.so.0.0.0[7f5c4c468000+155000]

There the instruction pointer is 0x00007f5c4c4f5464 not 0x00007f781c36a464. I guess they were from different runs.  So from the log, we can do some arithmetic and find the offset:

 0x7f5c4c4f5464 - 0x7f5c4c468000 = 0x8d464

and plug that into addr2line:

╎❯ addr2line -e /usr/lib/debug/usr/lib64/libmutter-1.so.debug
0x8d464
/usr/src/debug/mutter-3.26.2/src/compositor/meta-window-actor.c:918

which is here:

gboolean
meta_window_actor_is_destroyed (MetaWindowActor *self)
{
  return self->priv->disposed || self->priv->needs_destroy;
}

So we're calling meta_window_actor_is_destroyed on a freed actor, I guess?

but what about the core file... if we look at it with elfutils we can see the address of libmutter-1:

╎❯ eu-unstrip -n --core core |grep libmutter-1
0x7f781c2dd000+0x365000 6def04fd1bd914f903be05efd2d309b048260412@0x7f781c2dd1d8 . /usr/lib/debug/usr/lib64/libmutter-1.so.0.0.0.debug /usr/lib64/libmutter-1.so.0.0.0

Then do the same subtraction and we see:

0x7f781c36a464 - 0x7f781c2dd000 = 0x8d464

So the offset is the same 0x8d464, the crash from the coredump is in the same place as the crash from log even though it's from a different run.

Alright let's see if we can find out more about the other two valid looking frames, 0x7f7815e2b760 and 0x7f7815aeae42

╎❯ eu-unstrip -n --core core |grep 7f7815
0x7f78151d4000+0x204000 841e238a137d8df48ef7a61cefea4b08914830c4@0x7f78151d41d8 . /usr/lib/debug/usr/lib64/libdl-2.17.so.debug /usr/lib64/libdl-2.17.so
0x7f78153d8000+0x208000 0e311c2b7b04c5c0011a2c07ada4b2866bb0094e@0x7f78153d81d8 - - /usr/lib64/librt-2.17.so
0x7f78155e0000+0x205000 4c7aded703d3bfe82eaddc84b5199c1d50d1fa4a@0x7f78155e01d8 . /usr/lib/debug/usr/lib64/libcap.so.2.22.debug /usr/lib64/libcap.so.2.22
0x7f78157e5000+0x280000 0138c8af92cd91f778226863acb09cabe2bc059b@0x7f78157e51d8 . /usr/lib/debug/usr/lib64/pulseaudio/libpulsecommon-10.0.so.debug /usr/lib64/pulseaudio/libpulsecommon-10.0.so
0x7f7815a65000+0x3cd000 cb7083551c2067e12817edf26bf3ff182510fc8e@0x7f7815a65280 - - /usr/lib64/libc-2.17.so
0x7f7815e32000+0x21c000 a0108a7886e83065b78f861dd306a6f9adf50b85@0x7f7815e32248 - - /usr/lib64/libpthread-2.17.so

So the highest address less than each frame address is from libc for both: 0x7f7815a65000

0x7f7815e2b760 - 0x7f7815a65000 = 0x3c6760
0x7f7815aeae42 - 0x7f7815a65000 = 0x85e42

but the build id for that libc seems unusual. 

glibc-debuginfo-2.17-222.el7.x86_64 has 

╎❯ ls -l /usr/lib/debug/.build-id/85/ea0ae559b53ab60d8548242cedd0e83f4816da.debug
lrwxrwxrwx. 1 root root 30 Jul 17 13:22 /usr/lib/debug/.build-id/85/ea0ae559b53ab60d8548242cedd0e83f4816da.debug -> ../../lib64/libc-2.17.so.debug

and despite adding various rhel 7 repositories i can't find the one mentioned in the core file:

╎❯ sudo yum -y install /usr/lib/debug/.build-id/cb/7083551c2067e12817edf26bf3ff182510fc8e.debug
No package /usr/lib/debug/.build-id/cb/7083551c2067e12817edf26bf3ff182510fc8e.debug available.

Are they using a custom build of glibc?

Anyway, given the surrounding frames were gibberish, I don't have a lot of confidence the decoded symbols would be meaningful regardless.

Still, it would be interesting to know if they get a better backtrace from running gdb locally than I do on my machine.

But going on what info we do have...just looking at the callers of meta_window_actor_is_destroyed in the code, I see one interesting chain:

meta_pre_paint_func -> meta_window_actor_should_unredirect -> meta_window_actor_is_destroyed

It's interesting because the code is as follows:

  top_window_actor = compositor->top_window_actor;
  if (top_window_actor &&
      meta_window_actor_should_unredirect (top_window_actor) &&
      compositor->disable_unredirect_count == 0)
    {

and there was a fix upstream for crashers from stale top_window_actor here:

https://gitlab.gnome.org/GNOME/mutter/commit/b1587f0

So if that's the crash we're hitting, then we may be hitting it now because the GJS packages are more aggressive with garbage collection.

I'll do a scratch build with that fix pulled in and update the people page.

Comment 47 Milan Crha 2019-09-16 08:43:56 UTC
Running gnome-shell-3.28.5-16 and mutter-3.28.5-15 under valgrind, the most leaking part being reported here:

==1243== 161,226 (80 direct, 161,146 indirect) bytes in 2 blocks are definitely lost in loss record 46,238 of 46,271
==1243==    at 0x100C29F73: malloc (vg_replace_malloc.c:309)
==1243==    by 0x1018C768D: g_malloc (gmem.c:99)
==1243==    by 0x1018DEC8D: g_slice_alloc (gslice.c:1025)
==1243==    by 0x1018FD56D: g_variant_alloc (gvariant-core.c:476)
==1243==    by 0x1018FD56D: g_variant_new_from_children (gvariant-core.c:565)
==1243==    by 0x1018FA188: g_variant_builder_end (gvariant.c:3703)
==1243==    by 0x10134C403: parse_value_from_blob (gdbusmessage.c:1823)
==1243==    by 0x10134E372: g_dbus_message_new_from_blob (gdbusmessage.c:2148)
==1243==    by 0x1013589CA: _g_dbus_worker_do_read_cb (gdbusprivate.c:744)
==1243==    by 0x1013140E3: g_task_return_now (gtask.c:1148)
==1243==    by 0x101314118: complete_in_idle_cb (gtask.c:1162)
==1243==    by 0x1018BEC76: g_idle_dispatch (gmain.c:5533)
==1243==    by 0x1018C2048: g_main_dispatch (gmain.c:3175)
==1243==    by 0x1018C2048: g_main_context_dispatch (gmain.c:3828)
==1243==    by 0x1018C23A7: g_main_context_iterate.isra.19 (gmain.c:3901)
==1243==    by 0x1018C2679: g_main_loop_run (gmain.c:4097)
==1243==    by 0x1013569A5: gdbus_shared_thread_func (gdbusprivate.c:275)
==1243==    by 0x1018E94EF: g_thread_proxy (gthread.c:784)
==1243==    by 0x1036D0EA4: start_thread (in /usr/lib64/libpthread-2.17.so)
==1243==    by 0x1039E38DC: clone (in /usr/lib64/libc-2.17.so)

There are some other from libmozjs, libgjs and clutter, but not that significant as this one.

Comment 48 Milan Crha 2019-09-16 13:33:11 UTC
(In reply to Milan Crha from comment #47)
> ==1243== 161,226 (80 direct, 161,146 indirect) bytes in 2 blocks are
> definitely lost in loss record 46,238 of 46,271

Maybe these are not real leaks, at least those not piling up in time. I added some debugging prints and most of the messages being leaked come as signals from ca.desrt.dconf.Writer interface, I see around 137 of those on gnome-shell exit, but it can be just a coincidence, because a minute earlier there had been no such leaking GDBusMessage, thus it can be that the logout initiates a DConf write, which is not finished.

Comment 49 Milan Crha 2019-09-17 16:08:55 UTC
The change for atk from bug #1457206 is wrong, at least for gnome-shell (or possibly for anything but Evolution), it causes leaks of the AtkObject-s, even though it fixes the issue for Evolution. That makes the problem in Evolution, not in atk. My fault. I do not know whether there are other memory leaks in gnome-shell/mutter.

Comment 51 Milan Crha 2019-09-19 12:48:17 UTC
Created attachment 1616709 [details]
debugging patch for glib2

This is a debugging glib2 patch, which claims what glib's GObject-s had been created and which are still alive in gnome-shell. It dumps this information to /var/tmp/gobjs-PID-YYYYMMDD-HHmmSS.log files once per minute, together with memory statistic for that project. It creates also one "-atexit" file when the gnome-shell is closing. It's similar to glib's GObject instance tracking, though not exactly the same.

Maybe it'll help someone more knowledgeable of the gnome-shell internals to identify the culprit. Trying with de-patched atk (bug #1753123), I see some objects peek when holding Super+A for a minute or more, but these are freed after a minute or such, probably with some garbage collector or something like that. The Rss memory doesn't go down after this, but it was left at ~750MB once it reached this value. I know kernel is bad in releasing a lot of small freed blocks of memory, thus maybe it is the reason why the Rss doesn't go down. It seems to be reused at least.

Comment 52 Milan Crha 2019-09-19 12:50:02 UTC
The bug #1693156 comment #22 contains a test program which reproduces the kernel thing mentioned at the end of the previous comment.

Comment 60 Nathan Wallwork 2020-05-04 19:38:47 UTC
This appears to be similar to the RHEL8 bug (https://bugzilla.redhat.com/show_bug.cgi?id=1719819) "Gnome garbage collection leak [rhel-8]".
That was fixed with updates to gnome-shell and mutter packages.  (https://access.redhat.com/errata/RHSA-2020:1766)
Is a similar solution in the works, for RHEL 7?

Comment 61 Ray Strode [halfline] 2020-05-04 19:43:17 UTC
Investigation of this issue for Red Hat Enterprise Linux 7 is on-going.  Some of the performance work performed in Red Hat Enteprise Linux 8 is applicable.

Comment 105 Michael Boisvert 2020-10-23 17:42:23 UTC
I have been testing and monitoring the test packages that Ray made and I am happy to report positive results. Monitoring system memory shows gnome-shell memory usage does not increase over time and some "non-normal" usage doesn't trigger anything unsavory either.

Comment 114 errata-xmlrpc 2020-11-10 13:17:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (GNOME bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5048


Note You need to log in before you can comment on or make changes to this bug.