615505 – kernel memory leak 1GB in 5 seconds in radeon driver

Bug 615505 - kernel memory leak 1GB in 5 seconds in radeon driver

Summary: kernel memory leak 1GB in 5 seconds in radeon driver

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	xorg-x11-drv-ati
Sub Component:
Version:	13
Hardware:	All
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Dave Airlie
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-07-16 21:06 UTC by David Mansfield
Modified:	2010-12-08 01:09 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2010-11-30 00:12:02 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
/proc/meminfo output right before rebooting (1.12 KB, text/plain) 2010-07-16 21:06 UTC, David Mansfield	no flags	Details
/proc/meminfo output right after rebooting (1.12 KB, text/plain) 2010-07-16 21:08 UTC, David Mansfield	no flags	Details
/proc/slabinfo before rebooting (9.54 KB, text/plain) 2010-07-16 21:08 UTC, David Mansfield	no flags	Details
/proc/slabinfo after rebooting (9.54 KB, text/plain) 2010-07-16 21:09 UTC, David Mansfield	no flags	Details
lsmod output before reboot (1.44 KB, text/plain) 2010-07-16 21:11 UTC, David Mansfield	no flags	Details
lsmod output after reboot (1020 bytes, text/plain) 2010-07-16 21:11 UTC, David Mansfield	no flags	Details
slabinfo (10.06 KB, text/plain) 2010-07-22 18:13 UTC, d_jiten	no flags	Details
meminfo (1.14 KB, text/plain) 2010-07-22 18:13 UTC, d_jiten	no flags	Details
lsmod (2.22 KB, text/plain) 2010-07-22 18:13 UTC, d_jiten	no flags	Details
ps (7.46 KB, text/plain) 2010-07-22 18:14 UTC, d_jiten	no flags	Details
fix apparent memory leak for r600 based radeon (293 bytes, patch) 2010-08-30 17:38 UTC, David Mansfield	no flags	Details \| Diff
contents of /proc/vmallocinfo related to ttm or drm (8.93 KB, text/plain) 2010-09-08 17:30 UTC, David Mansfield	no flags	Details
output of /sys/kernel/debug/dri/0/radeon_vram_mm (53.20 KB, text/plain) 2010-09-26 11:13 UTC, Marc Dietrich	no flags	Details
output of radeon_gtt_mm (164.29 KB, text/plain) 2010-09-26 11:13 UTC, Marc Dietrich	no flags	Details
ttm_page_pool (215 bytes, text/plain) 2010-09-26 11:14 UTC, Marc Dietrich	no flags	Details
diff from vanilla 2.6.35 against Ubuntu 2.6.35 kernel (only relevant part) (1.88 KB, patch) 2010-09-27 07:25 UTC, Sean Davis	no flags	Details \| Diff
proposed patch to remove a race condition. (2.95 KB, patch) 2010-09-29 03:45 UTC, Dave Airlie	no flags	Details \| Diff
alternate fix (5.98 KB, text/plain) 2010-09-30 23:01 UTC, Dave Airlie	no flags	Details
Show Obsolete (1) View All

Description David Mansfield 2010-07-16 21:06:39 UTC

Created attachment 432499 [details]
/proc/meminfo output right before rebooting

Description of problem:
After running the system for 22 days, approx 1.5GB of memory had slowly vanished.

Version-Release number of selected component (if applicable):
kernel-2.6.33.5-124.fc13.x86_64

seems to apply to all recent versions

How reproducible:
always

Steps to Reproduce:
1. boot and wait
2.
3.
  
Actual results:


Expected results:


Additional info:
Here is the best I can give you, unfortunately

after 22 days my system was using far more memory to carry the same process load than it had immediately after booting.  So I shut down X and killed every process one-by-one, manually unloaded all unloadable modules.  This still showed 1.5GB used, not counting buffers or cache. 

I captured: /proc/meminfo /proc/slabinfo 'lsmod' and 'ps'

after rebooting, I repeated the experiment, killing all processes etc and recaptured the files.

The first set of files is labeled "afterunload" because it was after unloading modules, the second set is "afterboot" indicating after the reboot.

This probem is real, I received the following error yesterday:

Jul 15 15:35:44 gandalf kernel: Xorg: page allocation failure. order:0, mode:0x2
Jul 15 15:35:44 gandalf kernel: Pid: 2266, comm: Xorg Not tainted 2.6.33.5-124.fc13.x86_64 #1

The system also becomes slow and unresponsive, the GNOME memory meter shows full memory utilization.

This system is x86_64 with 4GB ram installed.

I'm happy to do anything to try to track this down, but I realize it may be impossible.

I shall attach the 8 documents now.

Comment 1 David Mansfield 2010-07-16 21:08:04 UTC

Created attachment 432500 [details]
/proc/meminfo output right after rebooting

Comment 2 David Mansfield 2010-07-16 21:08:56 UTC

Created attachment 432502 [details]
/proc/slabinfo before rebooting

Comment 3 David Mansfield 2010-07-16 21:09:25 UTC

Created attachment 432503 [details]
/proc/slabinfo after rebooting

Comment 4 David Mansfield 2010-07-16 21:11:16 UTC

Created attachment 432504 [details]
lsmod output before reboot

Comment 5 David Mansfield 2010-07-16 21:11:58 UTC

Created attachment 432505 [details]
lsmod output after reboot

Comment 6 Chuck Ebbert 2010-07-18 16:14:29 UTC

This is almost certainly happening because you're running with no swap space enabled.

Comment 7 David Mansfield 2010-07-19 16:14:16 UTC

i've added swap (swap was disabled because i was unable to install w/swap due to anaconda bugs)

i'll let you know in 25 days or so ;-)

Comment 8 d_jiten 2010-07-22 18:13:00 UTC

Created attachment 433766 [details]
slabinfo

Comment 9 d_jiten 2010-07-22 18:13:22 UTC

Created attachment 433767 [details]
meminfo

Comment 10 d_jiten 2010-07-22 18:13:44 UTC

Created attachment 433768 [details]
lsmod

Comment 11 d_jiten 2010-07-22 18:14:02 UTC

Created attachment 433769 [details]
ps

Comment 12 d_jiten 2010-07-22 18:15:40 UTC

I am having the same issue and have added the files that David mentions.

Comment 13 David Mansfield 2010-08-26 21:23:17 UTC

this memory leak has repeated.  i have been running a script from cron that tracks vm statistics and process statistics every minute (smem -w, free and ps -auwx)

unfortunately the memory leak of 900MB is "instantaneous", i.e. it occurs between two snapshots 1 minute apart.

unfortunately, the "ps" output before and after show nothing except that many processes were swapped out to make room for the 900MB "gulp"

i was using the computer at the time and was either:

- clicking around in "rhythmbox"
or
- clicking around in firefox on "images.google.com"

I can see some cache files in firefox directory during the minute the leak occurred, which seem to be some google javascript ajax type calls.

i now have set up an every second monitor which will alert me within a second when the kernel dynamic memory usage grows over 1GB so I can know which process is responsible.  

i will update this bug when i have more info.

Comment 14 David Mansfield 2010-08-27 15:08:23 UTC

this is now entirely reproducible on two different machines with radeon cards, and not reproducible on one machine with nvidia.  all running fedora 13 x86_64.

i have set the "component" to the xorg ati, but the bug is most likely in radeon.ko, which is in the "kernel" package, so I'll leave it to you guys to get this right.

both systems fully updated as of today:

kernel: 2.6.33.8-149.fc13.x86_64
xorg: xorg-x11-drv-ati-6.13.0-1.fc13.x86_64

1st machine, PCI:*(0:1:0:0) 1002:95c5:1028:9018 ATI Technologies Inc RV620 LE [Radeon HD 3450] rev 0, Mem @ 0xd0000000/268435456, 0xfdee0000/65536, I/O @ 0x0000de00/256, BIOS @ 0x????????/131072


2nd machine, PCI:*(0:1:5:0) 1002:9610:1028:02e2 ATI Technologies Inc Radeon HD 3200 Graphics rev 0, Mem @ 0xd0000000/268435456, 0xfeaf0000/65536, 0xfe900000/1048576, I/O @ 0x0000d000/256, BIOS @ 0x????????/131072

steps to reproduce.

1.reboot. 
2.login to gnome
2a. run "while true; do smem -w; sleep 1; done" in a terminal, watch the
"kernel dynamic memory" row, in the "noncache" column.
3.open firefox
4.go to "images.google.com"
5.search for "penico"  ;-)
6.using the scrollbar, scroll to the bottom BOOM there goes 200 megs, scroll to the top BOOM 200 megs. repeat as necessary

If you close firefox, or even  X, or unload modules or anything, you never get the memory back until you reboot

Comment 15 David Mansfield 2010-08-30 17:38:01 UTC

    i found a leak in r600_cs.c.  p->track is allocated at line 763 but never
    freed.  i'll attach a patch, but I have no idea if this could be causing my
    problem or not.  can someone help?

Comment 16 David Mansfield 2010-08-30 17:38:40 UTC

Created attachment 441998 [details]
fix apparent memory leak for r600 based radeon

Comment 17 Chuck Ebbert 2010-09-01 09:43:07 UTC

(In reply to comment #16)
> Created attachment 441998 [details]
> fix apparent memory leak for r600 based radeon

Can you build a kernel with your fix and try it?

Comment 18 Dave Airlie 2010-09-01 10:13:18 UTC

are you using nomodeset on the command line?

if so why?

the fix makes sense but running nomodeset doesn't for r600 hardware.

Comment 19 David Mansfield 2010-09-01 13:27:03 UTC

in re #17: i'm going to try the latest in updates-testing, 

kernel-2.6.34.6-47.fc13.x86_64

examining that code, the p->track is not leaking (the r600_cs.c is very much updated in this kernel) so if that's my problem, it'll be fixed just via that update.

as to #18: i don't have nomodeset on the command line, cmd line is:

ro root=/dev/mapper/VolGroup00-root_f13_fs rd_LVM_LV=VolGroup00/root_f13_fs rd_NO_LUKS LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYTABLE=us rhgb quiet

I'll let you all know if the new kernel fixes it. keep your fingers crossed.

Comment 20 David Mansfield 2010-09-01 13:40:16 UTC

bad news. still happens in the latest kernel as per comment#19 above

i have some more info, though this is subjective:

it seems the problem is with the actual scrolling.  the page i have has a ton of images on it (images.google.com), and when I scroll the screen "tears" as it scrolls, and it is at this point that 300M or so of dynamic memory is consumed (permanently).

if there's any way to turn on debugging, or strace xorg, or compile the kernel with some flags or anything like that, I'm game.

Comment 21 David Mansfield 2010-09-01 18:47:38 UTC

this debian bug seems absolutely identical

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=591061

Comment 22 Dave Airlie 2010-09-01 23:57:44 UTC

Can you try scrolling with the keyboard as opposed to with the mouse?

does it react any different?

I suspect we are missing a cleanup on a signal error path in this case.

Comment 23 David Mansfield 2010-09-02 13:35:59 UTC

it still happens exactly the same way. there is still the very noticable "tearing" happening in the images as they scroll and about 1gb memory in 5 seconds gone.

I was looking at the HTML source for the webpage, (images.google.com) and I noticed it's using some strange html, it's using a "canvas" and also img with a "data:blah" url (i.e. base64 encoded image in the url).

should I be looking at the r600_*.c or the radeon_*.c code (or both), e.g. which *_cs.c should I be looking at?  based on this hardware:

PCI:*(0:1:0:0) 1002:95c5:1028:9018 ATI Technologies Inc RV620 LE
[Radeon HD 3450] rev 0, Mem @ 0xd0000000/268435456, 0xfdee0000/65536, I/O @
0x0000de00/256, BIOS @ 0x????????/131072

Comment 24 Dave Airlie 2010-09-02 22:16:22 UTC

if you are using KMS you should only need to look at both, but you can ignore the legacy functions in r600_cs.c

Comment 25 David Mansfield 2010-09-03 13:20:09 UTC

since KMS keeps coming up, I just verified the bug does NOT occur if I boot with nomodeset.  hth

Comment 26 David Mansfield 2010-09-03 13:33:16 UTC

is there any way to get strace to show better ioctl tracing for the drm calls?

Comment 27 Dave Airlie 2010-09-06 22:19:19 UTC

ioctl tracing won't really help us here,

can you

mount -t debugfs none /sys/kernel/debug
and see what /sys/kernel/debug/dri/0/gem_objects contents is?

Comment 28 David Mansfield 2010-09-07 13:40:46 UTC

doesn't seem to show a leak.  before firefox is started (clean boot):

125 objects
103731200 object bytes
0 pinned
0 pin bytes
0 gtt bytes
0 gtt total

after firefox/x/drm has leaked 1gb and firefox is closed again:

196 objects
104062976 object bytes
0 pinned
0 pin bytes
0 gtt bytes
0 gtt total

Comment 29 Sean Davis 2010-09-08 01:59:22 UTC

I have the same problem with a radeon HD3200. Scrolling the uri mentioned in the debian bug report http://code.google.com/p/chromium/issues/detail?id=8991  or google images seems to be the best way to trigger it.

I have tested both 32 and 64-bit versions of Fedora 13. The memory leak is much bigger in 64-bit version but both version seems to stop leaking memory after a certain limit is reached. 

For 32-bit version leaking will stop here after noncached kernel dynamic memory reaches around 380M, for 64-bit version it will stop around 790M.

Also looking at /proc/vmallocinfo I see a lot of entries like this:

0xffffc9000032d000-0xffffc90000332000   20480 ttm_tt_create+0xfc/0x15b [ttm] pages=4 vmalloc N0=4

these entries only seem to appear after scrolling the mentioned web pages and do not seem to be reclaimed.

Comment 30 David Mansfield 2010-09-08 17:29:31 UTC

I also see many of the ttm_tt_create lines, as well as other drm related lines.

I just leaked about 1.7GB on a different website - this one had a javascript animated horizontal scrolling "headline".

I'm about to attach a text file of the contents of:

cat /proc/vmallocinfo |egrep '(ttm|drm)'

Comment 31 David Mansfield 2010-09-08 17:30:12 UTC

Created attachment 446053 [details]
contents of /proc/vmallocinfo related to ttm or drm

Comment 32 David Mansfield 2010-09-23 13:24:58 UTC

for the record, the leak does not occur when viewing the  sites with google chrome, only with firefox.  not that it's a firefox bug, but it's just an FYI.  

this bug forces me to reboot everytime i accidentally hit a "bad" webpage.  pretty disastrous

Comment 33 Eduardo Ivanec 2010-09-24 23:29:05 UTC

Hi, I'm the reporter at debian.org.

I'm also seeing a growing list of ttm/drm related entries in vmallocinfo after triggering this bug. You can find my vmallocinfo at three different times here:

http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=61;filename=vmallocinfo.txt;att=1;bug=591061

Regards.

Comment 34 Marc Dietrich 2010-09-26 11:13:05 UTC

Created attachment 449707 [details]
output of /sys/kernel/debug/dri/0/radeon_vram_mm

as requested on IRC, ouput of output of /sys/kernel/debug/dri/0/radeon_vram_mm,  radeon_gtt_mm and ttm_page_pool

Comment 35 Marc Dietrich 2010-09-26 11:13:42 UTC

Created attachment 449708 [details]
output of radeon_gtt_mm

Comment 36 Marc Dietrich 2010-09-26 11:14:27 UTC

Created attachment 449709 [details]
ttm_page_pool

Comment 37 Sean Davis 2010-09-27 07:25:07 UTC

Created attachment 449837 [details]
diff from vanilla 2.6.35 against Ubuntu 2.6.35 kernel (only relevant part)

The memory leak does not appear to be present in Ubuntu Maverick. I did a diff of drivers/gpu/drm/radeon/* from the Ubuntu kernel against vanilla v2.6.35, and after filtering out irrelevant parts I ended up with the attached diff.

After applying attached diff to vanilla 2.6.35 (but should also apply to newer versions) I have not seen any memory leak so far. I have no idea of this patch is correct or not, but maybe it can point the developers in the right direction.

Comment 38 Marc Dietrich 2010-09-27 09:25:58 UTC

yeah! this fixes it here also. I found the complete patch at 
https://patchwork.kernel.org/patch/95248/

Comment 39 Dave Airlie 2010-09-27 10:16:43 UTC

yeah its something in the eviction code that is leaking, I worked out that much today before I ran into another small leak I wanted to fix first,

this patch changes the buffer allocation enough that the eviction path is probably not getting hit as much, hopefully tomorrow I can find the actual problem.

Comment 40 Dave Airlie 2010-09-28 07:05:50 UTC

http://git.kernel.org/?p=linux/kernel/git/airlied/drm-2.6.git;a=commitdiff_plain;h=0fbecd400dd0a82d465b3086f209681e8c54cb0f

anyone care to try a kernel with that fix on it?

without the workaround fix that is in the other kernel

Comment 41 Eduardo Ivanec 2010-09-28 08:48:50 UTC

Hi Dave, I just applied your patch onto linux-2.6.36-rc4 and unfortunately the leak still seems to be there - can anyone else confirm this? I'm seeing the exact same behaviour when scrolling in Firefox, etc.

I uploaded my latest vmallocinfo to the report on bugs.debian.org.

In case someone finds it useful I can share my debian kpkg.

Comment 42 Dave Airlie 2010-09-28 08:52:43 UTC

today i tried my drm-fixes and couldn't reproduce with the google link that I was reproducing with fine yesterday with an older krenel

I backed out the most likely patches but can't figure it out,

tomorrow I'll go back to yesterdays kernel and try again, hopefully the chromium page hasn't changed content or something.

Comment 43 Marc Dietrich 2010-09-28 09:22:24 UTC

I'm always running these testing kernels (drm-radeon-testing, drm-fixes) which contained the fix already and the problem is still there.

Comment 44 Eduardo Ivanec 2010-09-28 14:43:29 UTC

I just tried with a patched clean -rc5 just in case (I had previously applied some other patches to -rc4) and the problem is still there.

Comment 45 Dave Airlie 2010-09-29 03:45:17 UTC

Created attachment 450370 [details]
proposed patch to remove a race condition.

Can you try this patch please?

Comment 46 Eduardo Ivanec 2010-09-29 05:40:19 UTC

It seems fixed! Thank you very much for your effort.

I'm using -rc5 patched with both your latest attachment and the previous fix (comment #40). Let me know if you would like me to retest without the previous patch if necessary.

Comment 47 Marc Dietrich 2010-09-29 08:35:54 UTC

confirmed - thanks!

Comment 48 J Denson 2010-09-30 07:16:33 UTC

Just to confirm the latest patch is all that's needed. Applied it to the current fc13 kernel 2.6.34.7-56.fc13 and the problem's gone.

Thanks! Claimed back about half a gig on my machine.

Comment 49 Dave Airlie 2010-09-30 23:01:50 UTC

Created attachment 450906 [details]
alternate fix

can someone test this fix instead of the one I posted earlier?

Comment 50 Eduardo Ivanec 2010-10-01 03:19:11 UTC

Just tested attachment 450906 [details] and the memory leak is still gone, but I seem to be getting sporadic screen corruption. I'm falling back to the previous patch to compare.

Comment 51 Eduardo Ivanec 2010-10-01 03:36:12 UTC

False alarm, I guess - I'm getting the same corruption with both patches, I just hadn't noticed it before. Is anyone else getting this? I'm seeing it on gkrellm2 specifically, in case anyone else is using it. I'm guessing it could be a different bug though.

Anyway, the latest patch seems as good as the previous.

Comment 52 David Mansfield 2010-10-05 19:04:27 UTC

to the poster in comment #50 and #51, is it possible your recompiled modules aren't loading properly?  I experience corruption when KMS is not enabled, but also the memory leak does not occur without KMS.  can you verify the patch ttm.ko is loading?  

dmesg | grep ttm

Comment 53 Eduardo Ivanec 2010-10-07 14:09:09 UTC

Hi David,

nox:/home/perseguidor# dmesg | grep -i ttm
[    4.375781] [TTM] Zone  kernel: Available graphics memory: 1965756 kiB.
[    4.375783] [TTM] Initializing pool allocator.

ttm is also shown on lsmod:

nox:/home/perseguidor# lsmod | grep -i ttm
ttm                    54479  1 radeon
drm                   191010  4 radeon,ttm,drm_kms_helper

Corruption is still there, but is very sporadic and happens on very specific apps. Perhaps this belongs in another bug report?

Comment 54 David Mansfield 2010-10-13 18:43:51 UTC

i believe a fix for this has been incorporated into the vanilla source, see:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=1df6a2ebd75067aefbdf07482bf8e3d0584e04ee

is it possible to get a fix incorporated into a fedora-13 errata kernel?

Comment 55 David Mansfield 2010-11-29 19:07:06 UTC

this is still not fixed in F14 latest kernel.

Comment 56 Kyle McMartin 2010-11-30 00:11:43 UTC

Thanks for identifying the commit, made my life much easier. Committed to F-14 now, sorry for taking so long.

Comment 57 Fedora Update System 2010-12-03 15:37:56 UTC

kernel-2.6.35.9-64.fc14 has been submitted as an update for Fedora 14.
https://admin.fedoraproject.org/updates/kernel-2.6.35.9-64.fc14

Comment 58 François Cami 2010-12-04 10:26:04 UTC

Please post comments, and karma, about your experience with kernel-2.6.35.9-64.fc14 in bohdi (see link in comment 57).

Comment 59 Fedora Update System 2010-12-05 00:42:05 UTC

kernel-2.6.35.9-64.fc14 has been pushed to the Fedora 14 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.