Bug 2091158
| Summary: | cockpit-bridge --privileged has a memory leak when open for extended periods of time. | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Tom Crider <tcrider> |
| Component: | cockpit | Assignee: | Martin Pitt <mpitt> |
| Status: | NEW --- | QA Contact: | Jan Ščotka <jscotka> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 8.6 | CC: | allison.karlitskaya, dareynol, jstormshak, mpitt, sbarcomb, tombouwman |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
I am also having this problem. I am running Fedora 37. I recently (2022-12-14) upgraded from Fedora 36. I am not sure whether it was already a problem in F36. Upgrade dates and versions: 2022-12-01: cockpit-bridge-279-1.fc36.x86_64 --> cockpit-bridge-280-1.fc36.x86_64 2022-12-14: cockpit-bridge-280-1.fc36.x86_64 --> cockpit-bridge-281-1.fc37.x86_64 (through DNF System Upgrade) 2022-12-19: cockpit-bridge-281-1.fc37.x86_64 --> cockpit-bridge-282-1.fc37.x86_64 What kind of info di you like to have? I cannot reproduce the problem after booting to kernel 6.0.15-300.fc37.x86_64 on January 1st 2023 The problem is back. Yesterday (Jan 22nd 2023) cockpit-bridge-282-1.fc37.x86_64 --> cockpit-bridge-283-1.fc37.x86_64 and a reboot to kernel 6.1.6-200.fc37.x86_64. What kind of info do you like to have? It is relatively easy to reproduce. 1. Login to cockpit 2. This wil bring you to Overview 3. Click on View metrics and history 4. Start top and sort on %MEM 5. Wait some time 6. All available will go to cockpit-bridge top - 17:01:41 up 13:01, 3 users, load average: 0.27, 0.31, 0.25
Tasks: 466 total, 2 running, 464 sleeping, 0 stopped, 0 zombie
%Cpu0 : 3.5 us, 0.6 sy, 0.0 ni, 95.1 id, 0.8 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu1 : 5.1 us, 1.5 sy, 0.0 ni, 92.5 id, 1.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 2.2 us, 1.4 sy, 0.0 ni, 95.9 id, 0.5 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu3 : 3.1 us, 2.1 sy, 0.0 ni, 94.0 id, 0.8 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 8100940 total, 137224 free, 7819316 used, 144400 buff/cache
KiB Swap: 24878072 total, 24202668 free, 675404 used. 75276 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
46129 root 20 0 7156720 6.5g 6536 R 10.8 84.6 6:48.96 cockpit-bridge
46158 root 20 0 388220 221076 6220 S 0.0 2.7 1:29.46 /usr/libexec/cockpit-pcp
20439 root 20 0 995240 63256 9712 S 0.0 0.8 0:31.87 /usr/libexec/packagekitd
1513 dbus 20 0 42736 38296 916 S 1.7 0.5 1:47.48 dbus-broker --log 4 --controller 9 --machine-id 37a41eb83b7f43ce8e61400ddda+
46104 cockpit+ 20 0 193980 35328 7092 S 0.0 0.4 0:54.56 /usr/libexec/cockpit-ws --for-tls-proxy --port=0
1 root 20 0 173380 12180 4844 S 2.4 0.2 3:07.26 /usr/lib/systemd/systemd rhgb --switched-root --system --deserialize 31
1587 root 20 0 1637080 12140 7540 S 0.0 0.1 0:05.10 /usr/sbin/libvirtd --timeout 120
46117 root 20 0 20192 12112 9160 S 0.2 0.1 0:01.62 /usr/lib/systemd/systemd --user
Not able to reproduce on cockpit-bridge-294.1-1.fc37.x86_64 anymore This may be closed as could not reproduce. Thanks Tom Bouwman. Unfortunately I was never able to reproduce it either, but let's leave the last word to Tom Crider as the original reporter. Going forward, we will soon switch Fedora to our new Python bridge reimplementation, which hopefully fixes this as well :-) But we won't introduce that into RHEL 8, it's too big of a change. (In reply to Tom Bouwman from comment #14) > Not able to reproduce on cockpit-bridge-294.1-1.fc37.x86_64 anymore > > This may be closed as could not reproduce. Hi Tom, Unfortunately this bug was originally opened for RHEL 8, not Fedora. We cannot currently validate it as resolved on RHEL, and generally a memory leak would be something that would need to be fixed. While I am glad to hear you're no longer experiencing the issue on Fedora, I don't feel closing the bug would be a good call here without validating whether or not it's still a problem in RHEL. For future reference a separate bug should be opened for Fedora, with a reference to the RHEL bug, as they are different distributions and versions, and would need to be tracked separately. As for now regarding the RHEL side, I am working to check if the issue persists as there have been 20+ version revisions in the RHEL cockpit-bridge package since the issue was reported over a year ago. Martin thank you again for your ongoing work on this, it has been greatly appreciated! Tom Crider, RHCE Software Maintenance Engineer Global Support Services - North America Red Hat, Inc. 1.888.GO.REDHAT |
Thank you Tom for your detailed report! Allison, I have some trouble making semse of any of the leaks. The initial ones ("still reachable") like these: ==00:00:32:23.619 124531== 1 bytes in 1 blocks are still reachable in loss record 1 of 2,364 ==00:00:32:23.619 124531== at 0x4C37135: malloc (vg_replace_malloc.c:381) ==00:00:32:23.620 124531== by 0x5497475: g_malloc (gmem.c:99) ==00:00:32:23.620 124531== by 0x54B0DB2: g_strdup (gstrfuncs.c:363) ==00:00:32:23.620 124531== by 0x5211634: g_param_spec_string (gparamspecs.c:2266) ==00:00:32:23.620 124531== by 0x15433F: cockpit_web_server_class_init (cockpitwebserver.c:357) ==00:00:32:23.620 124531== by 0x15433F: cockpit_web_server_class_intern_init (cockpitwebserver.c:85) ==00:00:32:23.620 124531== by 0x52229DA: type_class_init_Wm (gtype.c:2233) ==00:00:32:23.620 124531== by 0x52229DA: g_type_class_ref (gtype.c:2948) ==00:00:32:23.620 124531== by 0x52091C1: g_object_new_valist (gobject.c:2074) ==00:00:32:23.620 124531== by 0x52092AC: g_object_new (gobject.c:1642) ==00:00:32:23.620 124531== by 0x130184: cockpit_packages_new (cockpitpackages.c:1233) ==00:00:32:23.620 124531== by 0x11F4EE: setup_router (bridge.c:296) ==00:00:32:23.620 124531== by 0x11EC64: run_bridge (bridge.c:396) ==00:00:32:23.620 124531== by 0x11EC64: main (bridge.c:643) .. seem harmless: they get allocated at startup, are singletons, and are just *never* freed as they live as long as the bridge is running. So I took a closer look at the first "definitive" one: ==00:00:32:23.812 124531== 1,161,332 (606,384 direct, 554,948 indirect) bytes in 12,633 blocks are definitely lost in loss record 2,363 of 2,364 ==00:00:32:23.812 124531== at 0x4C37135: malloc (vg_replace_malloc.c:381) ==00:00:32:23.812 124531== by 0x5497475: g_malloc (gmem.c:99) ==00:00:32:23.812 124531== by 0x54AF086: g_slice_alloc (gslice.c:1026) ==00:00:32:23.812 124531== by 0x54CEF55: g_variant_get_child_value (gvariant-core.c:1041) ==00:00:32:23.812 124531== by 0x12175F: build_json_dictionary (cockpitdbusjson.c:705) ==00:00:32:23.812 124531== by 0x12175F: build_json (cockpitdbusjson.c:863) ==00:00:32:23.812 124531== by 0x121CCA: build_json_array_or_tuple (cockpitdbusjson.c:676) ==00:00:32:23.812 124531== by 0x12165A: build_json (cockpitdbusjson.c:881) ==00:00:32:23.812 124531== by 0x123380: build_json_body (cockpitdbusjson.c:960) ==00:00:32:23.812 124531== by 0x123380: build_json_signal (cockpitdbusjson.c:985) ==00:00:32:23.812 124531== by 0x123380: on_signal_message (cockpitdbusjson.c:2615) ==00:00:32:23.812 124531== by 0x4F098B7: emit_signal_instance_in_idle_cb (gdbusconnection.c:3721) ==00:00:32:23.812 124531== by 0x548E27A: g_idle_dispatch (gmain.c:5579) ==00:00:32:23.812 124531== by 0x549195C: g_main_dispatch (gmain.c:3193) ==00:00:32:23.812 124531== by 0x549195C: g_main_context_dispatch (gmain.c:3873) ==00:00:32:23.812 124531== by 0x5491D17: g_main_context_iterate.isra.21 (gmain.c:3946) ==00:00:32:23.812 124531== by 0x5491DAF: g_main_context_iteration (gmain.c:4007) ==00:00:32:23.812 124531== by 0x11ED8B: run_bridge (bridge.c:426) ==00:00:32:23.812 124531== by 0x11ED8B: main (bridge.c:643) on_signal_message() creates object and unrefs it at the end, so that balances. send_with_barrier() also refs it but on_wait_complete() unrefs it again. Without a batch, cockpit_dbus_cache_barrier() immediately calls the callback (on_wait_complete), so that looks safe. If there is a batch, it gets queued up, and barrier_progress() eventually calls the callback, but in either case, cockpitdbuscache.c never even introspects its userdata. Does that tell anything to you?