Bug 1880833
Summary: | Massive memory leak on AMD cards | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Daniel Mach <dmach> | ||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 33 | CC: | acaringi, agurenko, airlied, ajax, awilliam, billyfgarcia, bskeggs, bugzilla, caillon+fedoraproject, dmach, ego.cordatus, gmarr, hdegoede, ichavero, igor.raits, itamar, jakob, jarodwilson, jeremy, jglisse, john.j5live, jonathan, josef, kernel-maint, kleinkravis44, kparal, lgoncalv, linville, lyude, masami256, mchehab, mjg59, rclark, rhughes, rkudyba, robatino, rstrode, steved, tstellar | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | RejectedBlocker | ||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-12-01 07:25:27 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Daniel Mach
2020-09-20 12:02:19 UTC
Proposed as a Blocker for 33-final by Fedora user dmach using the blocker tracking app because: It makes Fedora unusable on systems with an AMD graphics card. The problem mention in https://gitlab.freedesktop.org/mesa/mesa/-/issues/3513 is very serious, but it doesn't seem to be Daniel's problem. According to https://gitlab.freedesktop.org/mesa/mesa/-/issues/3513#note_628193 it was caused by this commit https://gitlab.freedesktop.org/mesa/mesa/-/commit/3d5bed0e883217242a4357116399f60486580170 . That commit is not present in mesa-20.2.0-rc4 currently present in F33. It surely also wasn't present in F32, and Daniel says he experienced it even in F32 since August. So the issue must be different. Daniel, have you been using Fedora-provided mesa drivers on F32? I have Radeon 580 running on F32 and I experience no issues. So the problem that affects your card is likely not universal to all AMD graphics cards, just certain ones (or there is some other difference in play). Yes, I have Fedora-provided packages; nothing custom or 3rd party except couple additional packages from rpmfusion. The driver can use different code paths for our GPUs, I believe that mine is a newer generation. I'll probably try mesa build from F32 GA recompiled for F33 to see if the problem can be reproduced with a significantly older build (I hope it's going to build and install). If it existed with F32 then it may be a kernel regression. When was the last time this didn't happen? I suggest tracking down the x.y.0 kernels in koji and trying them out. That's easier to test than a bunch of mesa versions - I think. i.e. 5.4.0, 5.5.0, 5.6.0, 5.7.0 - perhaps in reverse order. I think it is an awful idea to release with this bug. Discussed at 2020-09-21 blocker review meeting: https://meetbot-raw.fedoraproject.org/fedora-blocker-review/2020-09-21/f33-blocker-review.2020-09-21-16.00.html . We agreed to delay decision on the blocker status of this bug because at present it's unclear how wide a range of hardware may be affected. I tried kernel-5.6.6-300.fc32.x86_64 (Fedora 32 GA kernel) and the problem is also there. I'll try an older kernel version and report back. (In reply to Daniel Mach from comment #0) > 2. run xorg + kodi (another app or a regular desktop session might work too) Have you seen it with anything other than kodi? Did kodi itself get updated around the time when this started happening? Assuming the memory is reclaimed when the kodi process dies, can you try valgrind --leak-check=full /usr/lib64/kodi/kodi-x11 and attach the output from that (after letting kodi run for some time and then quitting it)? I also tried downgrading kodi, but it did not help. Maybe I need to try even an older version. Memory is *not* reclaimed after kodi process dies. As I stated in comment#0, the memory seems to be consumed by the kernel. (In reply to Daniel Mach from comment #9) > Memory is *not* reclaimed after kodi process dies. Is it reclaimed when the Xorg process dies? If not, this sounds like a kernel bug (though it might only be triggered by newer userspace). I can't tell from the info reported if this is a mesa or kernel problem; and if it's a kernel problem whether it's a regression. More info is needed to progress it further. Discussed during the 2020-09-28 blocker review meeting: [0] The decision to delay the classification of this as a blocker bug was made as it's still unclear how wide a range of hardware may be affected and where the bug lies. [0] https://meetbot.fedoraproject.org/fedora-blocker-review/2020-09-28/f33-blocker-review.2020-09-28-16.01.txt Discussed during the 2020-10-05 blocker review meeting: [0] The decision to classify this bug as a "RejectedBlocker (Final)" was made as the current information suggests this is a corner case that does not have broad enough impact to block on. It can be re-proposed if further information suggests that isn't true. [0] https://meetbot.fedoraproject.org/fedora-blocker-review/2020-10-05/f33-blocker-review.2020-10-05-16.00.txt I did not manage to bisect the problem yet. The testing is slow, because it requires to let the system run to prove the memleak. I originally wanted to compile old versions of the packages and try them on my system, but they were frequently failing to build from source with new software in the build root. Then I realized that installing old python opens a possibility of downgrading to packages from older Fedoras: $ dnf --enablerepo=rawhide install python3.8 --nogpgcheck now I'm trying packages after running: $ dnf --releasever=31 --repoid=fedora --repoid=rpmfusion-free downgrade 'kodi*' 'mesa*' 'xorg*' --nogpgcheck I already tried --releasever=32, but it did not help. Downgrading packages even further did not help. Since the leaked memory is in kernel, reassigning to that component. Created attachment 1723908 [details]
kmemleak output
I installed and booted kernel-debug and ran following to get the file:
$ echo scan=on > /sys/kernel/debug/kmemleak
# waited for about 20 minutes
$ cat /sys/kernel/debug/kmemleak > kmemleak
I'm not skilled in debugging kernel, if someone could guide me, I'll do my best to provide more detailed information.
I had Fedora 32 Workstation running on the below specs with no issues. I downloaded the Fedora 33 workstation ISO and booted from that, with very slow sluggish performance. I performed install to drive from the same live DVD and aster booting from the internal SSD the issue was the same. Very slow sluggish performance. ASUS PRIME TRX40-PRO CPU: AMD Threadripper™ 3960X RAM: 32GB RA GPU: GeForce GTX 1650 4GB I've noticed a quite odd behavior which may explain why it's not easily reproducible by someone else: My NAS is connected to AV receiver which is connected to TV. The memleaks seem to occur when the receiver and TV are in standby mode. When they're on, kmemleak stops reporting to `dmesg`. Let me see what happens when I unplug HDMI cable. So it seems that the problem occurs when there's no active display. Unplugging the HDMI cable also causes memleaks. If you're already certain where the memleak is happening, ignore this. bcc-tools includes a memleak tool that's described in part: memleak traces and matches memory allocation and deallocation requests, and collects call stacks for each allocation. memleak can then print a summary of which call stacks performed allocations that weren't subsequently freed. This is what I got from the memleak tool (among other reports): 247463936 bytes in 118 allocations from stack __alloc_pages_nodemask+0x2bf [kernel] __alloc_pages_nodemask+0x2bf [kernel] ttm_alloc_new_pages.isra.0+0x9b [ttm] ttm_pool_populate.part.0+0x180 [ttm] ttm_populate_and_map_pages+0x1c5 [ttm] ttm_tt_populate.part.0+0x1e [ttm] ttm_tt_bind+0x48 [ttm] ttm_bo_handle_move_mem+0x5a9 [ttm] ttm_bo_validate+0x17c [ttm] ttm_bo_init_reserved+0x313 [ttm] amdgpu_bo_do_create+0x1a3 [amdgpu] amdgpu_bo_create+0x30 [amdgpu] amdgpu_gem_object_create+0x7b [amdgpu] amdgpu_gem_create_ioctl+0x93 [amdgpu] drm_ioctl_kernel+0x8c [drm] drm_ioctl+0x206 [drm] amdgpu_drm_ioctl+0x49 [amdgpu] ksys_ioctl+0x82 [kernel] __x64_sys_ioctl+0x16 [kernel] do_syscall_64+0x52 [kernel] entry_SYSCALL_64_after_hwframe+0x44 [kernel] $ uname -r 5.8.18-300.fc33.x86_64+debug Everything is working fine now, no leaks for about 2 weeks. Packages: kernel-5.9.10-200.fc33.x86_64 xorg-x11-server-Xorg-1.20.9-1.fc33.x86_64 xorg-x11-drv-amdgpu-19.1.0-5.fc33.x86_64 mesa-dri-drivers-20.2.3-1.fc33.x86_64 Still running the same kodi build. I'm closing the bug because it's not quite reproducible and I'm happy that the problem is gone. I've seen this issue on 3 Dell PowerEdge 740's. The kernel was around 5.9.15-200. I also see this in the journal logs: DMAR: [Firmware Bug]: No firmware reserved region can cover this RMRR [0x000000006f8a0000-0x000000006f8a2fff], contact BIOS vendor for > DMAR: [Firmware Bug]: Your BIOS is broken; bad RMRR [0x000000006f8a0000-0x000000006f8a2fff] BIOS vendor: Dell Inc.; Ver: 2.8.2; Product Version: DMAR: ATSR flags: 0x0 It's not an AMD card: lspci | grep -i --color 'vga\|3d\|2d' 03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04) So could this be a different issue? RobbieTheK: yes, since you have a Matrox adapter not AMD, you're definitely not seeing the same problem Daniel was seeing. |