Bug 470410 - possible memory leak in DLM system / kernel relating to recovery of nodes
possible memory leak in DLM system / kernel relating to recovery of nodes
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.3
All Linux
medium Severity medium
: rc
: ---
Assigned To: David Teigland
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-11-06 20:39 EST by Steven Dake
Modified: 2016-04-26 10:29 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-11-11 12:22:37 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Steven Dake 2008-11-06 20:39:15 EST
Description of problem:
I am not 100% positive if this is normal behavior or not but I thought I'd report it so it could be investigated.

I run revolver for approximately 25 recoveries and have found that any long-lived nodes, ie nodes that are not killed in those 25 recoveries appear to leak slab data.  I have noticed OOMs but I am not certain if this is the cause of the OOM killer being triggered or if there is some other issue at work.

The slab data is as follows:
LONG RUNNING SYSTEM:

Active / Total Objects (% used)    : 172325 / 185352 (93.0%)
 Active / Total Slabs (% used)      : 7657 / 7666 (99.9%)
 Active / Total Caches (% used)     : 109 / 159 (68.6%)
 Active / Total Size (% used)       : 27936.55K / 29327.98K (95.3%)
 Minimum / Average / Maximum Object : 0.01K / 0.16K / 128.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
 31075  31051  99%    0.03K    275      113      1100K size-32
 19169  19124  99%    0.13K    661       29      2644K dlm_lkb
 17208  16754  97%    0.05K    239       72       956K buffer_head
 16048  11512  71%    0.06K    272       59      1088K size-64
 15210  14957  98%    0.12K    507       30      2028K size-128
 11426  11354  99%    0.13K    394       29      1576K dentry_cache
 10582  10393  98%    0.33K    962       11      3848K inode_cache
  8655   8376  96%    0.25K    577       15      2308K size-256
  8596   8050  93%    0.27K    614       14      2456K radix_tree_node
  8119   7831  96%    0.16K    353       23      1412K gfs_glock
  6048   6032  99%    0.50K    756        8      3024K size-512
  4290   4204  97%    0.05K     55       78       220K sysfs_dir_cache
  2760   2682  97%    0.08K     60       46       240K vm_area_struct
  2456   2440  99%    0.48K    307        8      1228K ext3_inode_cache


SHORT RUNNING SYSTEM:
 Active / Total Objects (% used)    : 145791 / 154954 (94.1%)
 Active / Total Slabs (% used)      : 7237 / 7239 (100.0%)
 Active / Total Caches (% used)     : 110 / 159 (69.2%)
 Active / Total Size (% used)       : 26665.56K / 27771.18K (96.0%)
 Minimum / Average / Maximum Object : 0.01K / 0.18K / 128.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
 15420  15117  98%    0.12K    514       30      2056K size-128
 15048  14911  99%    0.05K    209       72       836K buffer_head
 14690  14265  97%    0.03K    130      113       520K size-32
 13452  12750  94%    0.06K    228       59       912K size-64
 10904  10871  99%    0.13K    376       29      1504K dentry_cache
 10648  10563  99%    0.33K    968       11      3872K inode_cache
 10500  10310  98%    0.27K    750       14      3000K radix_tree_node
  8850   8486  95%    0.25K    590       15      2360K size-256
  8326   8072  96%    0.16K    362       23      1448K gfs_glock
  8323   8051  96%    0.13K    287       29      1148K dlm_lkb
  6040   6026  99%    0.50K    755        8      3020K size-512
  4290   4185  97%    0.05K     55       78       220K sysfs_dir_cache
  3450   2711  78%    0.08K     75       46       300K vm_area_struct
  2456   2453  99%    0.48K    307        8      1228K ext3_inode_cache
  1794   1652  92%    0.08K     39       46       156K gfs_bufdata

As can be seen size-32 and dlm_lkb appear to leak in the longer running system.

Version-Release number of selected component (if applicable):

rhel 5.3 current dev tree

How reproducible:

highly reproducible.

Steps to Reproduce:
1. setup a 3 node cluster + revolver running plock load with -t 1 (1 minute timeout
2. Let cook for 1-2 hrs and watch slabtop on node 2.  For some reason node 2 can be long lived.
3. If node 2 is killed you will have to restart slabinfo and wait for the issue to reproduce.  revolver tends to want to kill 1 and 3 more often then node 2 so its a good choice to watch for leaks.  Another alternative is to force revolver never to kill a specific node but I don't know how to do this.
  
Actual results:
The system OOMs.  Not certain if this is the cause or not.  The actual results are above.

Expected results:
I would expect after a recovery operation for the slab data to be released and not increase each time a node is recovered.

Additional info:
Comment 1 David Teigland 2008-11-07 10:10:55 EST
This looks normal.  Also, none of the caches are using very much memory;
the biggest is just 4M for the inode_cache.

Note You need to log in before you can comment on or make changes to this bug.