352281 – Kernel memory leak with lots of threads?

Bug 352281 - Kernel memory leak with lots of threads?

Summary: Kernel memory leak with lots of threads?

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	7
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	428950 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-10-25 13:41 UTC by Chris Adams
Modified:	2008-02-11 22:38 UTC (History)
CC List:	5 users (show)
Fixed In Version:	2.6.23.15-80.fc7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-02-11 22:38:20 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Chris Adams 2007-10-25 13:41:12 UTC

I am running Fedora 7 with kernel-2.6.22.9-91.fc7.x86_64 on a dual Xeon 2.8GHz
(with hyperthreading) and 2G RAM.  The only thing running on this box is clamd
and clamav-milter.  There are several busy mail servers running sendmail that
connect to this milter to scan email for viruses, so clamav-milter often has
1000 threads (and they come and go quickly).  The CPU load is light (usually in
the 90% idle range).  The clamd and clamav-milter processes have a relatively
small resident size, although clamav-milter has a virtual size in the
multi-gigabyte range (I guess due to thread stacks?).

The kernel appears to be leaking RAM on this box at a fairly consistent rate. 
You can see the RAM use graph here:

http://status.hiwaay.net/cricket/grapher.cgi?target=/servers/net-snmp/hornet;view=RAM+use;ranges=d:w:m:y

Shutting down the clamd and clamav-milter processes has no effect.  The used RAM
of course goes down by ~100M, but that is equal to the resident size of the two
processes.  The only thing that will free up the memory is a reboot (that's
where the sawtooth jumps are on the graph).

Comment 1 Chuck Ebbert 2007-10-25 23:29:54 UTC

Did you try 2.6.23.1-10?

http://koji.fedoraproject.org/koji/buildinfo?buildID=21666

Comment 2 Chris Adams 2007-10-26 00:09:28 UTC

I have loaded that kernel.  I will watch it for a couple of days and let you
know how it looks.

Comment 3 Chris Adams 2007-10-29 17:36:54 UTC

Still appears to be leaking memory.

The only other thing "interesting" I can think of about this system is that
clamav-milter and clamd are using a tmpfs for storage (clamav-milter actually
writes incoming messages to a temp directory and signals clamd to scan the
files).  I have it mounted with "mode=1777,size=500m" and it usually doesn't
have much data in it (it has 640K right now).

Just to see if it made a difference, I stopped clamav-milter and clamd and
unmounted the tmpfs, and the system still showed over 300M RAM in use not
counting buffers/cache.  There are no other processes using any significant
amount of resident RAM.

Comment 4 Chuck Ebbert 2007-10-29 18:33:13 UTC

What does /proc/meminfo say as the count goes up?

Also, with SLUB there should be lots of debug info in /sys/slab/*

The file Documentation/vm/slub.txt from the kernel-doc package describes how to
use the SLUB debugging.

Comment 5 Chris Adams 2007-10-29 19:05:19 UTC

I figured I start with seeing what slabinfo reports on the running (no debug
enabled) kernel, and I get:

Unresolved alias :0000360

If I look in /sys/slab, kcopyd_job is a symlink to ../slab/:0000360, which does
not exist (all the other symlinks appear to point to things that exist).

Comment 6 Chuck Ebbert 2007-10-29 19:40:48 UTC

kcopyd is part of lvm, and used by dm_snapshot and dm_mirror.

Comment 7 Chris Adams 2007-10-29 19:47:49 UTC

For the moment, I hacked slabinfo.c to ignore kcopyd_job, just to see what I
got.  It looks like the :0000016 slab is the problem:

Name                   Objects Objsize    Space Slabs/Part/Cpu  O/S O %Fr %Ef Flg
:0000016              14699932      16   235.2M      57423/8/4  256 0   0   8 *

I enabled tracing on that slab for about a minute which made the server really
busy (since that's a busy slab I guess).  I'm not quite sure what to do with the
logged info.

Comment 8 Chuck Ebbert 2007-10-29 21:10:09 UTC

I think you need to try booting with slub_debug=U

It's hard to say how much that will slow down the system, but it shouldn't be
too bad.

Comment 9 Chris Adams 2007-10-30 13:32:47 UTC

Here is the info on kmalloc-16:

Slabcache: kmalloc-16            Aliases:  0 Order :  0 Objects: 69745

Sizes (bytes)     Slabs              Debug                Memory
------------------------------------------------------------------------
Object :      16  Total  :    1091   Sanity Checks : Off  Total: 4468736
SlabObj:      64  Full   :    1082   Redzoning     : Off  Used : 1115920
SlabSiz:    4096  Partial:       5   Poisoning     : Off  Loss : 3352816
Loss   :      48  CpuSlab:       4   Tracking      : On   Lalig: 3347760
Align  :       8  Objects:      64   Tracing       : Off  Lpadd:       0

kmalloc-16 has no kmem_cache operations

kmalloc-16: Kernel object allocation
-----------------------------------------------------------------------
      4 arch_acpi_processor_init_pdc+0x41/0x118 age=491088/491092/491100 pid=1
cpus=0
      4 arch_acpi_processor_init_pdc+0x95/0x118 age=491088/491092/491100 pid=1
cpus=0
    226 sys_init_module+0x13ac/0x168c age=479352/484890/489962 pid=516-1454 cpus=0-3
     13 mempool_create_node+0x53/0xca age=489299/490985/492039 pid=1-576 cpus=0-1
      1 __vmalloc_area_node+0x71/0x170 age=212438 pid=2074 cpus=0
     11 alloc_vfsmnt+0xbe/0xdb age=476442/483363/491794 pid=1-1495 cpus=0-3
      1 proc_symlink+0x51/0x8f age=492335 pid=0 cpus=0
    808 sysfs_new_dirent+0x29/0x105 age=456619/489711/492072 pid=1-2204 cpus=0-3
      3 copy_semundo+0x32/0x79 age=458940/460478/463526 pid=1972-2074 cpus=2-3
     11 key_alloc+0xef/0x34c age=455601/459193/464277 pid=1933-2230 cpus=0-3
     17 selinux_key_alloc+0x21/0x48 age=455601/463408/492343 pid=0-2230 cpus=0-3
    824 selinux_file_alloc_security+0x2a/0x53 age=0/303719/484179 pid=1-20430
cpus=0-3
      1 hashtab_create+0x4d/0x83 age=484339 pid=1 cpus=1
    159 perm_read+0x75/0xd1 age=484338/484344/484346 pid=1 cpus=1
      1 policydb_init+0xa7/0x12e age=484339 pid=1 cpus=1
     21 class_read+0xc6/0x233 age=484338/484343/484346 pid=1 cpus=1
      3 role_read+0x79/0x139 age=484345/484345/484345 pid=1 cpus=1
    981 type_read+0x89/0xe5 age=484333/484343/484345 pid=1 cpus=1
      1 user_read+0x80/0x121 age=484333 pid=1 cpus=1
      1 sens_read+0x28/0xfe age=484333 pid=1 cpus=1
      6 policydb_read+0x3e0/0xbef age=484239/484239/484239 pid=1 cpus=0
      1 policydb_read+0x645/0xbef age=484238 pid=1 cpus=0
     12 policydb_read+0x7ac/0xbef age=484238/484238/484238 pid=1 cpus=0
      6 policydb_read+0x89f/0xbef age=484238/484238/484238 pid=1 cpus=0
  18011 security_netlbl_sid_to_secattr+0x73/0x105 age=12/212076/474187
pid=1632-3338 cpus=0-3
    117 cond_read_list+0xec/0x201 age=484239/484260/484298 pid=1 cpus=0
     22 cond_read_bool+0xc4/0xde age=484333/484333/484333 pid=1 cpus=1
  48393 cond_insertf+0xdd/0x138 age=484239/484269/484298 pid=1 cpus=0
     21 kvasprintf+0x44/0x6b age=481130/491870/492408 pid=0-930 cpus=0
      3 acpi_ds_build_internal_package_obj+0x9b/0x11f age=492009/492012/492014
pid=1 cpus=1
      7 acpi_ds_build_internal_buffer_obj+0xd8/0x123 age=492006/492007/492008
pid=1 cpus=1
      2 reserve_range+0x25/0xe7 age=491746/491767/491789 pid=1 cpus=1
      5 pnpacpi_option_resource+0x172/0x401 age=491850/491850/491850 pid=1 cpus=1
      3 hub_probe+0x1ed/0x681 age=489634/489786/489928 pid=516-526 cpus=1-2
      3 usb_cache_string+0x5a/0x88 age=489641/489793/489936 pid=516-526 cpus=1-2
      1 usb_create_ep_files+0x43/0x2d9 age=489935 pid=516 cpus=1
      1 pci_acpi_scan_root+0x23/0xc8 age=491968 pid=1 cpus=1
      1 proto_register+0x111/0x1ec age=491640 pid=1 cpus=1
      2 genl_register_family+0xd4/0x14c age=491794/491798/491802 pid=1 cpus=1
      1 netlbl_domhsh_init+0x26/0xc9 age=491802 pid=1 cpus=1
     15 pnpacpi_add_device_handler+0x16d/0x296 age=491846/491851/491863 pid=1 cpus=1
      2 pnpacpi_add_device_handler+0x220/0x296 age=491853/491853/491854 pid=1 cpus=1
      1 pci_mmcfg_arch_init+0x26/0x103 age=492052 pid=1 cpus=1
      1 pci_mmcfg_e7520+0x5c/0xa8 age=492059 pid=1 cpus=1
      1 ext3_fill_super+0x71f/0x1472 [ext3] age=484663 pid=1 cpus=3
      2 run+0xe4/0x422 [raid1] age=479530/482338/485147 pid=645-1437 cpus=0,3
      6 linear_ctr+0x31/0xc3 [dm_mod] age=484996/485029/485063 pid=650 cpus=1-3
      6 dev_create+0xa0/0x1fa [dm_mod] age=484996/485029/485063 pid=650 cpus=1-3

kmalloc-16: Kernel object freeing
------------------------------------------------------------------------
  58147 <not-available> age=4295159737 pid=0 cpus=0
      4 free_sect_attrs+0x1f/0x31 age=479865/481223/485299 pid=636-1392 cpus=0-1
     14 slab_objects+0x241/0x268 age=38128/319152/407343 pid=2382-19662 cpus=0-3
     69 put_filp+0x1c/0x45 age=161882/464441/491164 pid=1-2254 cpus=0-3
  11304 __fput+0xd2/0x191 age=2/221287/491176 pid=1-20427 cpus=0-3
      1 free_vfsmnt+0xd/0x1d age=484620 pid=1 cpus=3
      1 bio_free_map_data+0xc/0x15 age=457491 pid=2175 cpus=2
      1 bio_uncopy_user+0xf4/0x10e age=457491 pid=2175 cpus=2
      5 remove_kevent+0x31/0x44 age=456796/460105/464967 pid=1900 cpus=1-3
      3 release_sysfs_dirent+0x38/0x95 age=479865/487357/491114 pid=1-1392 cpus=0
      2 key_cleanup+0x72/0xf2 age=383438/419532/455626 pid=15-17 cpus=0,2
      2 key_cleanup+0xde/0xf2 age=383438/400265/417093 pid=15-17 cpus=0,2
     10 selinux_bprm_free_security+0x10/0x1d age=54456/389987/483934
pid=657-18857 cpus=0-3
      1 sel_write_load+0x50e/0x5da age=484198 pid=1 cpus=0
      1 role_destroy+0x9/0x27 age=484359 pid=1 cpus=1
      1 policydb_read+0x125/0xbef age=484352 pid=1 cpus=1
      1 ksign_free_signature+0x35/0x44 age=491135 pid=1 cpus=0
    142 kobject_uevent_env+0x3cc/0x40f age=456632/488785/492086 pid=1-2204 cpus=0-3
      1 acpi_ds_create_operand+0x1d6/0x1f2 age=491860 pid=1 cpus=1
      4 acpi_evaluate_object+0x1c9/0x1e5 age=491102/491106/491113 pid=1 cpus=0
      4 acpi_ut_delete_internal_obj+0x151/0x162 age=491102/491110/491119 pid=1
cpus=0
      1 reserve_range+0x8f/0xe7 age=491764 pid=1 cpus=1
      1 device_del+0xbd/0x2ea age=490824 pid=1 cpus=0
     10 device_add+0x210/0x630 age=480597/487754/490824 pid=1-1058 cpus=0-2
      3 bus_add_driver+0xb1/0x19d age=487847/489734/491079 pid=1-596 cpus=0-2
      4 class_device_add+0x26e/0x3ca age=488592/488938/489289 pid=596 cpus=0,3
      3 usb_get_configuration+0xe86/0xee3 age=489654/489807/489950 pid=516-526
cpus=1-2
      1 get_modalias+0xb9/0xf2 age=492073 pid=1 cpus=1

kmalloc-16: No NUMA information available.


The allocation entry that is increasing most is security_netlbl_sid_to_secattr,
and the freeing entries that are increasing most are <not-available> and __fput.


I rebooted with SELinux disabled, and kmalloc-16 is now sitting as 1200 objects
and holding (and everything else looks to be steady as well).

Comment 10 Chuck Ebbert 2007-11-01 18:19:44 UTC

Was the above trace from kernel 2.6.23.1-10? I'm trying to track down the leak.

Comment 11 Chris Adams 2007-11-01 18:41:16 UTC

Yes, it was from 2.6.23-1.10.fc7.x86_64.

Comment 12 Christopher Brown 2008-01-16 03:02:13 UTC

Hello,

I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the Fedora kernel.

http://fedoraproject.org/wiki/KernelBugTriage

There hasn't been much activity on this bug for a while. Could you tell me if
you are still having problems with the latest kernel?

If the problem no longer exists then please close this bug or I'll do so in a
few days if there is no additional information lodged.

Comment 13 Chris Adams 2008-01-18 19:38:52 UTC

I have not tried it again; this is a production system so I can't just poke at
it.  I was waiting to see if someone thought the problem might be fixed before I
tried again; if something has changed that should fix it, let me know and I'll
try it.

Comment 14 Chuck Ebbert 2008-01-18 23:11:55 UTC

*** Bug 428950 has been marked as a duplicate of this bug. ***

Comment 15 Paul Moore 2008-01-19 03:36:51 UTC

I've just posted a quick (aka untested) patch which I believe should solve the 
problem, if you are able to test the patch please let me know if this solves 
your problem.  The patch can be found here:

 * http://marc.info/?l=linux-netdev&m=120071344429617&w=2

Comment 16 Christopher Brown 2008-01-21 15:46:46 UTC

Chris,

I can create a test kernel for you with just this additional patch applied if
you wish. I take it you are still running Fedora 7?

Comment 17 Chris Adams 2008-01-21 15:50:51 UTC

I'm building kernel-2.6.23.12-52 right now with this patch applied.  I'll have
to re-label the filesystem as well (since I've loaded updates and such since
setting SELinux to disabled), but I think I will be able to install this today.
 I'll let you know what I see.

Comment 18 Chris Adams 2008-01-21 17:54:05 UTC

It has been running about an hour now, and it is looking okay.

One question for Red Hat kernel people on this CC: I'm planning on switching
this box from Fedora 7 to RHEL 5; will this bug also affect the current RHEL 5
kernel (and if so, will this patch be applied there as well)?

Comment 19 Eric Paris 2008-01-21 18:52:59 UTC

selinux_netlbl_socket_setsid() in RHEL5 does not have this leak.  I looked
through all of the struct netlbl_lsm_secattr users in
security/selinux/ss/services.c and they all seem to init/destroy.

Comment 20 Chris Adams 2008-01-21 19:56:16 UTC

Good to know about RHEL - thanks for checking that.

Comment 21 Chuck Ebbert 2008-01-22 23:46:12 UTC

Fix is in 2.6.23.14-66.

Comment 22 Fedora Update System 2008-02-11 03:33:25 UTC

kernel-2.6.23.15-80.fc7 has been submitted as an update for Fedora 7

Comment 23 Fedora Update System 2008-02-11 22:38:12 UTC

kernel-2.6.23.15-80.fc7 has been pushed to the Fedora 7 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.