Bug 1880833

Summary: Massive memory leak on AMD cards
Product: [Fedora] Fedora Reporter: Daniel Mach <dmach>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 33CC: acaringi, agurenko, airlied, ajax, awilliam, billyfgarcia, bskeggs, bugzilla, caillon+fedoraproject, dmach, ego.cordatus, gmarr, hdegoede, ichavero, igor.raits, itamar, jakob, jarodwilson, jeremy, jglisse, john.j5live, jonathan, josef, kernel-maint, kleinkravis44, kparal, lgoncalv, linville, lyude, masami256, mchehab, mjg59, rclark, rhughes, rkudyba, robatino, rstrode, steved, tstellar
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: RejectedBlocker
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-01 07:25:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
kmemleak output none

Description Daniel Mach 2020-09-20 12:02:19 UTC
Description of problem:
My Fedora based home NAS/HTPC runs out of memory over night.
There's only idling kodi.


Version-Release number of selected component (if applicable):
mesa-dri-drivers-20.2.0~rc4-1.fc33.x86_64
xorg-x11-server-Xorg-1.20.8-3.fc33.x86_64


How reproducible:
always, but takes time


Steps to Reproduce:
1. boot a computer with an AMD graphic card
2. run xorg + kodi (another app or a regular desktop session might work too)
3. keep it idling for hours


Actual results:
All memory and swap is consumed by kernel, because it doesn't show up in processes or disk cache.
The only option is to reboot.


Expected results:
No memory leaks.


Additional info:
I'm running Athlon 200ge, integrated Radeon Vega 3 GPU, using amdgpu driver
I'm experiencing this problems since August, I tried upgrade F32 to F33 on 2020-08-26 to check if newer version of packages fixed the problem, but it did not help.

These might be related:
https://www.reddit.com/r/linux/comments/irnrqv/warning_about_a_brandnew_memory_leak_in_mesa_at/
https://gitlab.freedesktop.org/mesa/mesa/-/issues/3513
https://www.reddit.com/r/archlinux/comments/ippg4i/linux_literally_eats_all_my_ram_and_swap_anyone/

Comment 1 Fedora Blocker Bugs Application 2020-09-20 12:04:05 UTC
Proposed as a Blocker for 33-final by Fedora user dmach using the blocker tracking app because:

 It makes Fedora unusable on systems with an AMD graphics card.

Comment 2 Kamil Páral 2020-09-21 11:34:59 UTC
The problem mention in https://gitlab.freedesktop.org/mesa/mesa/-/issues/3513 is very serious, but it doesn't seem to be Daniel's problem. According to https://gitlab.freedesktop.org/mesa/mesa/-/issues/3513#note_628193 it was caused by this commit https://gitlab.freedesktop.org/mesa/mesa/-/commit/3d5bed0e883217242a4357116399f60486580170 . That commit is not present in mesa-20.2.0-rc4 currently present in F33. It surely also wasn't present in F32, and Daniel says he experienced it even in F32 since August. So the issue must be different.

Daniel, have you been using Fedora-provided mesa drivers on F32? I have Radeon 580 running on F32 and I experience no issues. So the problem that affects your card is likely not universal to all AMD graphics cards, just certain ones (or there is some other difference in play).

Comment 3 Daniel Mach 2020-09-21 14:08:14 UTC
Yes, I have Fedora-provided packages; nothing custom or 3rd party except couple additional packages from rpmfusion.
The driver can use different code paths for our GPUs, I believe that mine is a newer generation.
I'll probably try mesa build from F32 GA recompiled for F33 to see if the problem can be reproduced with a significantly older build (I hope it's going to build and install).

Comment 4 Chris Murphy 2020-09-21 18:19:07 UTC
If it existed with F32 then it may be a kernel regression. When was the last time this didn't happen? I suggest tracking down the x.y.0 kernels in koji and trying them out. That's easier to test than a bunch of mesa versions - I think. i.e. 5.4.0, 5.5.0, 5.6.0, 5.7.0 - perhaps in reverse order.

Comment 5 Klein Kravis 2020-09-21 18:34:54 UTC
I think it is an awful idea to release with this bug.

Comment 6 Adam Williamson 2020-09-21 19:27:48 UTC
Discussed at 2020-09-21 blocker review meeting: https://meetbot-raw.fedoraproject.org/fedora-blocker-review/2020-09-21/f33-blocker-review.2020-09-21-16.00.html . We agreed to delay decision on the blocker status of this bug because at present it's unclear how wide a range of hardware may be affected.

Comment 7 Daniel Mach 2020-09-22 14:28:21 UTC
I tried kernel-5.6.6-300.fc32.x86_64 (Fedora 32 GA kernel) and the problem is also there.
I'll try an older kernel version and report back.

Comment 8 Michel Dänzer 2020-09-24 08:08:11 UTC
(In reply to Daniel Mach from comment #0)
> 2. run xorg + kodi (another app or a regular desktop session might work too)

Have you seen it with anything other than kodi?

Did kodi itself get updated around the time when this started happening?

Assuming the memory is reclaimed when the kodi process dies, can you try

 valgrind --leak-check=full /usr/lib64/kodi/kodi-x11

and attach the output from that (after letting kodi run for some time and then quitting it)?

Comment 9 Daniel Mach 2020-09-25 06:28:06 UTC
I also tried downgrading kodi, but it did not help.
Maybe I need to try even an older version.

Memory is *not* reclaimed after kodi process dies.
As I stated in comment#0, the memory seems to be consumed by the kernel.

Comment 10 Michel Dänzer 2020-09-25 08:37:30 UTC
(In reply to Daniel Mach from comment #9)
> Memory is *not* reclaimed after kodi process dies.

Is it reclaimed when the Xorg process dies?

If not, this sounds like a kernel bug (though it might only be triggered by newer userspace).

Comment 11 Chris Murphy 2020-09-28 16:39:37 UTC
I can't tell from the info reported if this is a mesa or kernel problem; and if it's a kernel problem whether it's a regression. More info is needed to progress it further.

Comment 12 Geoffrey Marr 2020-09-28 17:03:16 UTC
Discussed during the 2020-09-28 blocker review meeting: [0]

The decision to delay the classification of this as a blocker bug was made as it's still unclear how wide a range of hardware may be affected and where the bug lies.

[0] https://meetbot.fedoraproject.org/fedora-blocker-review/2020-09-28/f33-blocker-review.2020-09-28-16.01.txt

Comment 13 Geoffrey Marr 2020-10-05 20:00:11 UTC
Discussed during the 2020-10-05 blocker review meeting: [0]

The decision to classify this bug as a "RejectedBlocker (Final)" was made as the current information suggests this is a corner case that does not have broad enough impact to block on. It can be re-proposed if further information suggests that isn't true.

[0] https://meetbot.fedoraproject.org/fedora-blocker-review/2020-10-05/f33-blocker-review.2020-10-05-16.00.txt

Comment 14 Daniel Mach 2020-10-14 16:05:10 UTC
I did not manage to bisect the problem yet.
The testing is slow, because it requires to let the system run to prove the memleak.

I originally wanted to compile old versions of the packages and try them on my system,
but they were frequently failing to build from source with new software in the build root.

Then I realized that installing old python opens a possibility of downgrading to packages from older Fedoras:
$ dnf --enablerepo=rawhide install python3.8 --nogpgcheck

now I'm trying packages after running:
$ dnf --releasever=31 --repoid=fedora --repoid=rpmfusion-free downgrade 'kodi*' 'mesa*' 'xorg*' --nogpgcheck

I already tried --releasever=32, but it did not help.

Comment 15 Daniel Mach 2020-10-24 07:54:14 UTC
Downgrading packages even further did not help.
Since the leaked memory is in kernel, reassigning to that component.

Comment 16 Daniel Mach 2020-10-24 07:57:57 UTC
Created attachment 1723908 [details]
kmemleak output

I installed and booted kernel-debug and ran following to get the file:
$ echo scan=on > /sys/kernel/debug/kmemleak
# waited for about 20 minutes
$ cat /sys/kernel/debug/kmemleak  > kmemleak

I'm not skilled in debugging kernel, if someone could guide me, I'll do my best to provide more detailed information.

Comment 17 Fabian 2020-10-29 11:23:25 UTC
I had Fedora 32 Workstation running on the below specs with no issues. I downloaded the Fedora 33 workstation ISO and booted from that, with very slow sluggish performance. I performed install to drive from the same live DVD and aster booting from the internal SSD the issue was the same. Very slow sluggish performance.   

ASUS PRIME TRX40-PRO
 
CPU: AMD Threadripper™ 3960X
 
RAM: 32GB RA
 
GPU: GeForce GTX 1650 4GB

Comment 18 Daniel Mach 2020-10-30 13:11:27 UTC
I've noticed a quite odd behavior which may explain why it's not easily reproducible by someone else:
My NAS is connected to AV receiver which is connected to TV.
The memleaks seem to occur when the receiver and TV are in standby mode.
When they're on, kmemleak stops reporting to `dmesg`.

Let me see what happens when I unplug HDMI cable.

Comment 19 Daniel Mach 2020-10-31 18:04:24 UTC
So it seems that the problem occurs when there's no active display.
Unplugging the HDMI cable also causes memleaks.

Comment 20 Chris Murphy 2020-11-01 21:30:28 UTC
If you're already certain where the memleak is happening, ignore this. bcc-tools includes a memleak tool that's described in part: memleak traces and matches memory allocation and deallocation requests, and collects call stacks for each allocation. memleak can then print a summary of which call stacks performed allocations that weren't subsequently freed.

Comment 21 Daniel Mach 2020-11-14 17:55:44 UTC
This is what I got from the memleak tool (among other reports):
        247463936 bytes in 118 allocations from stack
                __alloc_pages_nodemask+0x2bf [kernel]
                __alloc_pages_nodemask+0x2bf [kernel]
                ttm_alloc_new_pages.isra.0+0x9b [ttm]
                ttm_pool_populate.part.0+0x180 [ttm]
                ttm_populate_and_map_pages+0x1c5 [ttm]
                ttm_tt_populate.part.0+0x1e [ttm]
                ttm_tt_bind+0x48 [ttm]
                ttm_bo_handle_move_mem+0x5a9 [ttm]
                ttm_bo_validate+0x17c [ttm]
                ttm_bo_init_reserved+0x313 [ttm]
                amdgpu_bo_do_create+0x1a3 [amdgpu]
                amdgpu_bo_create+0x30 [amdgpu]
                amdgpu_gem_object_create+0x7b [amdgpu]
                amdgpu_gem_create_ioctl+0x93 [amdgpu]
                drm_ioctl_kernel+0x8c [drm]
                drm_ioctl+0x206 [drm]
                amdgpu_drm_ioctl+0x49 [amdgpu]
                ksys_ioctl+0x82 [kernel]
                __x64_sys_ioctl+0x16 [kernel]
                do_syscall_64+0x52 [kernel]
                entry_SYSCALL_64_after_hwframe+0x44 [kernel]

$ uname -r
5.8.18-300.fc33.x86_64+debug

Comment 22 Daniel Mach 2020-12-01 07:25:27 UTC
Everything is working fine now, no leaks for about 2 weeks.

Packages:
kernel-5.9.10-200.fc33.x86_64
xorg-x11-server-Xorg-1.20.9-1.fc33.x86_64
xorg-x11-drv-amdgpu-19.1.0-5.fc33.x86_64
mesa-dri-drivers-20.2.3-1.fc33.x86_64

Still running the same kodi build.

I'm closing the bug because it's not quite reproducible and I'm happy that the problem is gone.

Comment 23 RobbieTheK 2021-03-08 18:43:49 UTC
I've seen this issue on 3 Dell PowerEdge 740's. The kernel was around 5.9.15-200. I also see this in the journal logs:

DMAR: [Firmware Bug]: No firmware reserved region can cover this RMRR [0x000000006f8a0000-0x000000006f8a2fff], contact BIOS vendor for >
DMAR: [Firmware Bug]: Your BIOS is broken; bad RMRR [0x000000006f8a0000-0x000000006f8a2fff] BIOS vendor: Dell Inc.; Ver: 2.8.2; Product Version:
DMAR: ATSR flags: 0x0

It's not an AMD card:
lspci | grep -i --color 'vga\|3d\|2d'
03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04)

So could this be a different issue?

Comment 24 Adam Williamson 2021-09-24 23:33:22 UTC
RobbieTheK: yes, since you have a Matrox adapter not AMD, you're definitely not seeing the same problem Daniel was seeing.