Bug 1196062 - Kernel OOM killer is terminating ovirt engine - Out of memory: Kill process 16014 (java) score 245 or sacrifice child
Summary: Kernel OOM killer is terminating ovirt engine - Out of memory: Kill process 1...
Keywords:
Status: CLOSED DUPLICATE of bug 1190466
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine-setup
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.5.3
Assignee: Sandro Bonazzola
QA Contact: Jiri Belka
URL:
Whiteboard: integration
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-02-25 08:55 UTC by Raz Tamir
Modified: 2015-03-31 07:19 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-03-31 07:19:14 UTC
oVirt Team: ---
Target Upstream Version:
Embargoed:
ratamir: Triaged?


Attachments (Terms of Use)
engine log (8.69 MB, text/plain)
2015-02-25 08:55 UTC, Raz Tamir
no flags Details
messages (450.88 KB, text/plain)
2015-02-25 08:59 UTC, Raz Tamir
no flags Details

Description Raz Tamir 2015-02-25 08:55:27 UTC
Created attachment 995030 [details]
engine log

Description of problem:
After few hours of runnig automated tests, the engine service suddenly died - we get error code 503.
Nothing appears in engine.log, but when looking into /var/log/messages we see below output:
Feb 25 07:49:53 jenkins-vm-33 kernel: Out of memory: Kill process 16014 (java) score 245 or sacrifice child
Feb 25 07:49:53 jenkins-vm-33 kernel: Killed process 16014, UID 108, (java) total-vm:2908156kB, anon-rss:971900kB, file-rss:880kB
Feb 25 07:50:01 jenkins-vm-33 2015-02-25 07:50:00,890 ovirt-engine: ERROR run:532 Error: process terminated with status code -9


Version-Release number of selected component (if applicable):
el6.6
vt13.11

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Raz Tamir 2015-02-25 08:59:27 UTC
Created attachment 995031 [details]
messages

Comment 2 Gil Klein 2015-02-25 09:22:06 UTC
from dmesg output:

Mem-Info:
Node 0 DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
CPU    1: hi:    0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd:  61
CPU    1: hi:  186, btch:  31 usd:  59
Node 0 Normal per-cpu:
CPU    0: hi:  186, btch:  31 usd: 172
CPU    1: hi:  186, btch:  31 usd: 161
active_anon:723263 inactive_anon:184266 isolated_anon:0
 active_file:1485 inactive_file:2271 isolated_file:0
 unevictable:6116 dirty:78 writeback:6 unstable:0
 free:21254 slab_reclaimable:4135 slab_unreclaimable:16500
 mapped:3515 shmem:2100 pagetables:7636 bounce:0
Node 0 DMA free:15724kB min:248kB low:308kB high:372kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15320kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 3512 4017 4017
Node 0 DMA32 free:60896kB min:58868kB low:73584kB high:88300kB active_anon:2683416kB inactive_anon:527328kB active_file:4088kB inactive_file:6980kB unevictable:22272kB isolated(anon):0kB isolated(file):0kB present:3596496kB mlocked:5920kB dirty:212kB writeback:24kB mapped:8828kB shmem:6092kB slab_reclaimable:7624kB slab_unreclaimable:40096kB kernel_stack:2232kB pagetables:18812kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:23360 all_unreclaimable? yes
lowmem_reserve[]: 0 0 505 505
Node 0 Normal free:8396kB min:8464kB low:10580kB high:12696kB active_anon:209636kB inactive_anon:209736kB active_file:1852kB inactive_file:2104kB unevictable:2192kB isolated(anon):0kB isolated(file):0kB present:517120kB mlocked:2192kB dirty:100kB writeback:0kB mapped:5232kB shmem:2308kB slab_reclaimable:8916kB slab_unreclaimable:25904kB kernel_stack:1216kB pagetables:11732kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:10540 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 3*4kB 2*8kB 1*16kB 2*32kB 2*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15724kB
Node 0 DMA32: 1414*4kB 1065*8kB 632*16kB 284*32kB 106*64kB 42*128kB 24*256kB 12*512kB 1*1024kB 1*2048kB 0*4096kB = 60896kB
Node 0 Normal: 437*4kB 189*8kB 47*16kB 35*32kB 5*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 8396kB
10897 total pagecache pages
4396 pages in swap cache
Swap cache stats: add 1858415, delete 1854019, find 1525435/1649361
Free swap  = 0kB
Total swap = 2097148kB
1048575 pages RAM
68326 pages reserved
34277 pages shared
950558 pages non-shared
[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
[  399]     0   399     2849       64   0     -17         -1000 udevd
[  945]     0   945     2280       93   1       0             0 dhclient
[  997]     0   997    62272      218   0       0             0 rsyslogd
[ 1027]     0  1027     2707      122   0       0             0 irqbalance
[ 1048]    32  1048     4744      134   0       0             0 rpcbind
[ 1068]    29  1068     5837      164   0       0             0 rpc.statd
[ 1199]    81  1199    24343      149   0       0             0 dbus-daemon
[ 1244]     0  1244     1020      124   1       0             0 acpid
[ 1254]    68  1254     9441      265   0       0             0 hald
[ 1255]     0  1255     5099      132   0       0             0 hald-runner
[ 1287]     0  1287     5629      119   0       0             0 hald-addon-inpu
[ 1298]    68  1298     4501      146   1       0             0 hald-addon-acpi
[ 1318]     0  1318    96533      184   0       0             0 automount
[ 1341]     0  1341    16554       86   0     -17         -1000 sshd
[ 1387]   497  1387    10366      155   0       0             0 nrpe
[ 1464]     0  1464    20214      222   0       0             0 master
[ 1473]    89  1473    20278      207   0       0             0 qmgr
[ 1489]     0  1489    28660      141   0       0             0 abrtd
[ 1499]     0  1499    29214      163   0       0             0 crond
[ 1512]     0  1512     5276       69   0       0             0 atd
[ 1521]     0  1521    27083      109   0       0             0 rhsmcertd
[ 1540]     0  1540    45806     8479   0       0             0 puppet
[ 1561]     0  1561     1016      122   0       0             0 mingetty
[ 1563]     0  1563     1016      122   0       0             0 mingetty
[ 1565]     0  1565     1016      122   1       0             0 mingetty
[ 1567]     0  1567     1016      122   0       0             0 mingetty
[ 1569]     0  1569     1016      122   0       0             0 mingetty
[ 1571]     0  1571     1016      122   1       0             0 mingetty
[ 1580]     0  1580     3079       98   0     -17         -1000 udevd
[ 1916]     0  1916    27267      250   1       0             0 sshd
[ 2005]     0  2005     2848       46   0     -17         -1000 udevd
[ 2025]     0  2025    26514      152   0       0             0 bash
[ 2047]     0  2047   635456    11247   0       0             0 java
[ 2075]     0  2075     7237     5699   0       0           -17 iscsiuio
[ 2081]     0  2081     1232      114   1       0             0 iscsid
[ 2082]     0  2082     1358      845   1       0           -17 iscsid
[ 3706]    26  3706    54311      345   0     -17         -1000 postmaster
[ 3708]    26  3708    44722       91   0     -17         -1000 postmaster
[ 3710]    26  3710    54376      531   0     -17         -1000 postmaster
[ 3711]    26  3711    54311      122   1     -17         -1000 postmaster
[ 3712]    26  3712    54411      206   0     -17         -1000 postmaster
[ 3713]    26  3713    44792      159   0     -17         -1000 postmaster
[14382]     0 14382    26300       30   1       0             0 rpc.rquotad
[14387]     0 14387     5417       80   1       0             0 rpc.mountd
[14429]     0 14429     6291       74   0       0             0 rpc.idmapd
[14472]   108 14472    55859      579   0       0             0 ovirt-fence-kdu
[14475]    26 14475    55387      649   0     -17         -1000 postmaster
[14560]     0 14560    50195      309   0       0             0 httpd
[14566]    48 14566    50395      434   0       0             0 httpd
[14567]    48 14567    50300      637   0       0             0 httpd
[14568]    48 14568    50331      680   1       0             0 httpd
[14569]    48 14569    50294      393   0       0             0 httpd
[14570]    48 14570    50300      618   1       0             0 httpd
[14571]    48 14571    50287      643   1       0             0 httpd
[14572]    48 14572    50294      630   1       0             0 httpd
[14573]    48 14573    50294      628   1       0             0 httpd
[14603]   108 14603    67024      298   0       0             0 ovirt-websocket
[16003]   108 16003    47844       82   1       0             0 ovirt-engine.py
[16014]   108 16014   724983   266534   1       0             0 java
[16107]    26 16107    83475    27670   1     -17         -1000 postmaster
[16228]    26 16228    84160    28633   1     -17         -1000 postmaster
[16451]    26 16451    83077    27161   1     -17         -1000 postmaster
[16890]    26 16890    80665    25667   1     -17         -1000 postmaster
[17116]    26 17116    84503    28078   1     -17         -1000 postmaster
[18546]    26 18546    81736    27379   1     -17         -1000 postmaster
[18547]    26 18547    82215    27142   1     -17         -1000 postmaster
[18548]    26 18548    82869    27290   1     -17         -1000 postmaster
[18666]    48 18666    50311      659   1       0             0 httpd
[27424]    48 27424    50307      644   1       0             0 httpd
[  668]    26   668    83157    28224   1     -17         -1000 postmaster
[  669]    26   669    80970    25352   1     -17         -1000 postmaster
[  670]    26   670    82476    27899   1     -17         -1000 postmaster
[21828]    48 21828    50307      633   0       0             0 httpd
[21829]    48 21829    50307      407   1       0             0 httpd
[21830]    48 21830    50309      412   0       0             0 httpd
[21834]    48 21834    50307      417   0       0             0 httpd
[22088]    48 22088    50310      406   0       0             0 httpd
[22089]    48 22089    50310      411   0       0             0 httpd
[22090]    48 22090    50294      394   0       0             0 httpd
[23268]    48 23268    50310      641   0       0             0 httpd
[23269]    48 23269    50311      672   0       0             0 httpd
[23270]    48 23270    50307      647   1       0             0 httpd
[ 3491]     0  3491    26526       13   0       0             0 sh
[ 3517]     0  3517    90295      107   0       0             0 python
[ 3619]     0  3619   678353   302219   0       0             0 python
[ 4374]     0  4374    24993       76   0       0             0 sshd
[22611]    26 22611    75483    19908   1     -17         -1000 postmaster
[  379]    38   379     7683       45   0       0             0 ntpd
[10769]    89 10769    20234       29   0       0             0 pickup
[16765]     0 16765    25235       43   1       0             0 tail
[17142]   497 17142    10884      411   0       0             0 nrpe
[17144]   497 17144    10884      411   0       0             0 nrpe
[17146]   497 17146    10884      411   1       0             0 nrpe
[17147]   497 17147    10885      143   1       0             0 nrpe
[17148]   497 17148    10885      143   0       0             0 nrpe
[17149]   497 17149    33299     1277   0       0             0 check_linux_sta
[17150]   497 17150    33299     1298   0       0             0 check_linux_sta
[17151]   497 17151    10885      143   1       0             0 nrpe
[17152]   497 17152    33299     1269   0       0             0 check_linux_sta
Out of memory: Kill process 3619 (python) score 307 or sacrifice child
Killed process 3619, UID 0, (python) total-vm:2713412kB, anon-rss:1208320kB, file-rss:556kB
SELinux: initialized (dev 0:17, type nfs), uses genfs_contexts
SELinux: initialized (dev 0:18, type nfs), uses genfs_contexts
SELinux: initialized (dev 0:17, type nfs), uses genfs_contexts
SELinux: initialized (dev 0:18, type nfs), uses genfs_contexts
SELinux: initialized (dev 0:17, type nfs), uses genfs_contexts
SELinux: initialized (dev 0:18, type nfs), uses genfs_contexts
SELinux: initialized (dev 0:17, type nfs), uses genfs_contexts
SELinux: initialized (dev 0:18, type nfs), uses genfs_contexts
puppet invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
puppet cpuset=/ mems_allowed=0
Pid: 2312, comm: puppet Not tainted 2.6.32-504.8.1.el6.x86_64 #1
Call Trace:
 [<ffffffff810d40c1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
 [<ffffffff81127300>] ? dump_header+0x90/0x1b0
 [<ffffffff810d5161>] ? cpuset_mems_allowed_intersects+0x21/0x30
 [<ffffffff81127782>] ? oom_kill_process+0x82/0x2a0
 [<ffffffff811276c1>] ? select_bad_process+0xe1/0x120
 [<ffffffff81127bc0>] ? out_of_memory+0x220/0x3c0
 [<ffffffff811344df>] ? __alloc_pages_nodemask+0x89f/0x8d0
 [<ffffffff8116c69a>] ? alloc_pages_current+0xaa/0x110
 [<ffffffff811246f7>] ? __page_cache_alloc+0x87/0x90
 [<ffffffff811240de>] ? find_get_page+0x1e/0xa0
 [<ffffffff81125697>] ? filemap_fault+0x1a7/0x500
 [<ffffffff8114eae4>] ? __do_fault+0x54/0x530
 [<ffffffff810b33c6>] ? futex_wait+0x1e6/0x310
 [<ffffffff8114f0b7>] ? handle_pte_fault+0xf7/0xb00
 [<ffffffff810a2dd0>] ? hrtimer_wakeup+0x0/0x30
 [<ffffffff810a3da4>] ? hrtimer_start_range_ns+0x14/0x20
 [<ffffffff8114fcea>] ? handle_mm_fault+0x22a/0x300
 [<ffffffff81063bf3>] ? perf_event_task_sched_out+0x33/0x70
 [<ffffffff8104d0d8>] ? __do_page_fault+0x138/0x480
 [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
 [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff81041e98>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8152ffde>] ? do_page_fault+0x3e/0xa0
 [<ffffffff8152d395>] ? page_fault+0x25/0x30
Mem-Info:
Node 0 DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
CPU    1: hi:    0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd:   0
CPU    1: hi:  186, btch:  31 usd:  30
Node 0 Normal per-cpu:
CPU    0: hi:  186, btch:  31 usd:   0
CPU    1: hi:  186, btch:  31 usd:  30
active_anon:801151 inactive_anon:109094 isolated_anon:0
 active_file:34 inactive_file:58 isolated_file:0
 unevictable:6116 dirty:0 writeback:0 unstable:0
 free:21264 slab_reclaimable:4154 slab_unreclaimable:16711
 mapped:2203 shmem:3091 pagetables:8270 bounce:0
Node 0 DMA free:15724kB min:248kB low:308kB high:372kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15320kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 3512 4017 4017
Node 0 DMA32 free:60872kB min:58868kB low:73584kB high:88300kB active_anon:2991752kB inactive_anon:223432kB active_file:116kB inactive_file:208kB unevictable:22272kB isolated(anon):0kB isolated(file):0kB present:3596496kB mlocked:5920kB dirty:0kB writeback:0kB mapped:2972kB shmem:7960kB slab_reclaimable:7764kB slab_unreclaimable:41044kB kernel_stack:2368kB pagetables:22016kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:960 all_unreclaimable? no
lowmem_reserve[]: 0 0 505 505
Node 0 Normal free:8460kB min:8464kB low:10580kB high:12696kB active_anon:212852kB inactive_anon:212944kB active_file:20kB inactive_file:0kB unevictable:2192kB isolated(anon):0kB isolated(file):0kB present:517120kB mlocked:2192kB dirty:0kB writeback:0kB mapped:5840kB shmem:4404kB slab_reclaimable:8852kB slab_unreclaimable:25800kB kernel_stack:1216kB pagetables:11064kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:143 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 3*4kB 2*8kB 1*16kB 2*32kB 2*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15724kB
Node 0 DMA32: 1290*4kB 996*8kB 632*16kB 246*32kB 106*64kB 44*128kB 24*256kB 12*512kB 3*1024kB 1*2048kB 0*4096kB = 60936kB
Node 0 Normal: 261*4kB 179*8kB 70*16kB 22*32kB 7*64kB 3*128kB 3*256kB 1*512kB 2*1024kB 0*2048kB 0*4096kB = 8460kB
13679 total pagecache pages
9835 pages in swap cache
Swap cache stats: add 5819923, delete 5810088, find 5522866/6015792
Free swap  = 0kB
Total swap = 2097148kB
1048575 pages RAM
68326 pages reserved
37837 pages shared
952554 pages non-shared
[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
[  399]     0   399     2849       64   0     -17         -1000 udevd
[  945]     0   945     2280       88   1       0             0 dhclient
[  997]     0   997    62272      221   0       0             0 rsyslogd
[ 1027]     0  1027     2707      115   0       0             0 irqbalance
[ 1048]    32  1048     4744      134   0       0             0 rpcbind
[ 1068]    29  1068     5837      165   0       0             0 rpc.statd
[ 1199]    81  1199    24343      148   0       0             0 dbus-daemon
[ 1244]     0  1244     1020      124   1       0             0 acpid
[ 1254]    68  1254     9441      264   0       0             0 hald
[ 1255]     0  1255     5099      131   0       0             0 hald-runner
[ 1287]     0  1287     5629      119   0       0             0 hald-addon-inpu
[ 1298]    68  1298     4501      146   1       0             0 hald-addon-acpi
[ 1318]     0  1318    96533      184   0       0             0 automount
[ 1341]     0  1341    16554      113   1     -17         -1000 sshd
[ 1387]   497  1387    10366      153   0       0             0 nrpe
[ 1464]     0  1464    20214      221   0       0             0 master
[ 1473]    89  1473    20278      206   0       0             0 qmgr
[ 1489]     0  1489    28660      140   0       0             0 abrtd
[ 1499]     0  1499    29214      162   0       0             0 crond
[ 1512]     0  1512     5276       68   0       0             0 atd
[ 1521]     0  1521    27083      107   0       0             0 rhsmcertd
[ 1540]     0  1540    45806     2960   0       0             0 puppet
[ 1561]     0  1561     1016      121   0       0             0 mingetty
[ 1563]     0  1563     1016      121   0       0             0 mingetty
[ 1565]     0  1565     1016      121   1       0             0 mingetty
[ 1567]     0  1567     1016      121   0       0             0 mingetty
[ 1569]     0  1569     1016      121   0       0             0 mingetty
[ 1571]     0  1571     1016      121   1       0             0 mingetty
[ 1580]     0  1580     3079       98   1     -17         -1000 udevd
[ 1916]     0  1916    27267      247   1       0             0 sshd
[ 2005]     0  2005     2848       46   0     -17         -1000 udevd
[ 2025]     0  2025    26514      151   0       0             0 bash
[ 2047]     0  2047   635456     7158   0       0             0 java
[ 2075]     0  2075     7237     5699   0       0           -17 iscsiuio
[ 2081]     0  2081     1232      114   0       0             0 iscsid
[ 2082]     0  2082     1358      845   0       0           -17 iscsid
[ 3706]    26  3706    54311      310   0     -17         -1000 postmaster
[ 3708]    26  3708    44722       88   0     -17         -1000 postmaster
[ 3710]    26  3710    54376      892   0     -17         -1000 postmaster
[ 3711]    26  3711    54311      114   1     -17         -1000 postmaster
[ 3712]    26  3712    54409      190   1     -17         -1000 postmaster
[ 3713]    26  3713    44792      149   1     -17         -1000 postmaster
[14382]     0 14382    26300       30   1       0             0 rpc.rquotad
[14387]     0 14387     5417       80   1       0             0 rpc.mountd
[14429]     0 14429     6291       73   1       0             0 rpc.idmapd
[14472]   108 14472    55859      430   0       0             0 ovirt-fence-kdu
[14475]    26 14475    55387      439   0     -17         -1000 postmaster
[14560]     0 14560    50195      254   0       0             0 httpd
[14566]    48 14566    50395      424   0       0             0 httpd
[14567]    48 14567    50300      364   1       0             0 httpd
[14568]    48 14568    50331      407   0       0             0 httpd
[14569]    48 14569    50294      388   0       0             0 httpd
[14570]    48 14570    50286      389   1       0             0 httpd
[14571]    48 14571    50287      392   1       0             0 httpd
[14572]    48 14572    50294      412   1       0             0 httpd
[14573]    48 14573    50294      389   0       0             0 httpd
[14603]   108 14603    67024      226   0       0             0 ovirt-websocket
[16003]   108 16003    47844       80   1       0             0 ovirt-engine.py
[16014]   108 16014   727039   243195   1       0             0 java
[16107]    26 16107    85429    29032   0     -17         -1000 postmaster
[16228]    26 16228    84776    27820   0     -17         -1000 postmaster
[16451]    26 16451    84613    25970   1     -17         -1000 postmaster
[16890]    26 16890    84369    27955   1     -17         -1000 postmaster
[17116]    26 17116    85549    28879   1     -17         -1000 postmaster
[18546]    26 18546    85363    27447   1     -17         -1000 postmaster
[18547]    26 18547    83189    27153   1     -17         -1000 postmaster
[18548]    26 18548    81105    23263   0     -17         -1000 postmaster
[18666]    48 18666    50311      398   0       0             0 httpd
[27424]    48 27424    50307      403   0       0             0 httpd
[  668]    26   668    83509    24314   0     -17         -1000 postmaster
[  669]    26   669    81226    25677   0     -17         -1000 postmaster
[  670]    26   670    83970    26284   0     -17         -1000 postmaster
[21828]    48 21828    50307      429   0       0             0 httpd
[21829]    48 21829    50307      365   0       0             0 httpd
[21830]    48 21830    50309      363   1       0             0 httpd
[21834]    48 21834    50307      429   0       0             0 httpd
[22088]    48 22088    50310      432   0       0             0 httpd
[22089]    48 22089    50310      410   0       0             0 httpd
[22090]    48 22090    50294      402   0       0             0 httpd
[23268]    48 23268    50310      401   0       0             0 httpd
[23269]    48 23269    50311      402   0       0             0 httpd
[23270]    48 23270    50307      427   1       0             0 httpd
[22611]    26 22611    81449    24325   1     -17         -1000 postmaster
[16765]     0 16765    25236       12   1       0             0 tail
[ 9388]     0  9388    25236       13   1       0             0 tail
[15425]     0 15425    25236       13   0       0             0 tail
[14722]    26 14722    80939    24366   1     -17         -1000 postmaster
[14723]    26 14723    79952    23725   0     -17         -1000 postmaster
[14724]    26 14724    80298    20724   1     -17         -1000 postmaster
[14725]    26 14725    79149    23090   0     -17         -1000 postmaster
[14726]    26 14726    82188    25619   1     -17         -1000 postmaster
[20504]     0 20504    26526        1   0       0             0 sh
[20531]     0 20531    90296        1   0       0             0 python
[20598]     0 20598   469821   174785   1       0             0 python
[21440]     0 21440    24993       62   0       0             0 sshd
[ 3119]    38  3119     7683       32   0       0             0 ntpd
[12810]    26 12810    73247    17433   0     -17         -1000 postmaster
[12811]    26 12811    73708    18315   1     -17         -1000 postmaster
[21721]    89 21721    20234       17   0       0             0 pickup
[23941]     0 23941    25235       22   1       0             0 tail
[24301]     0 24301    24994      245   1       0             0 sshd
[24311]     0 24311    27084       84   0       0             0 bash
[24388]    26 24388    56410     1940   0     -17         -1000 postmaster
[24389]    26 24389    57066     2694   1     -17         -1000 postmaster
[24390]    26 24390    60104     5576   1     -17         -1000 postmaster
[24391]    26 24391    56361     1936   0     -17         -1000 postmaster
[24392]    26 24392    57351     3156   1     -17         -1000 postmaster
[24393]    26 24393    56962     2642   0     -17         -1000 postmaster
Out of memory: Kill process 16014 (java) score 245 or sacrifice child
Killed process 16014, UID 108, (java) total-vm:2908156kB, anon-rss:971900kB, file-rss:880kB

Comment 3 Oved Ourfali 2015-02-25 09:39:49 UTC
What version was it tested with?

Comment 4 Raz Tamir 2015-02-25 09:43:15 UTC
Version-Release number of selected component:
el6.6
vt13.11

Comment 5 Oved Ourfali 2015-02-25 11:16:53 UTC
How reproducible is this?

Comment 6 Raz Tamir 2015-02-25 11:39:17 UTC
It reproduced when running bunch of tests - around 100 cases

Comment 7 Liran Zelkha 2015-02-25 12:29:23 UTC
Please share heapdump of the server from when service was killed.

Comment 8 Gil Klein 2015-02-25 14:55:55 UTC
Liarn, it seems the system did not took an heap dump when the service was killed

[root@jenkins-vm-33 ~]# ls -l /var/log/ovirt-engine/dump/
total 0

Comment 9 Liran Zelkha 2015-02-25 15:47:01 UTC
Can you take a snapshot regularly so we'll have it before the crash?

Comment 11 Raz Tamir 2015-02-26 07:36:58 UTC
Hi Liran,
Are you talking about snapshot for the vm that holds the engine? and if so, why do you need it?

Comment 12 Liran Zelkha 2015-02-26 08:39:02 UTC
No, I'm talking about taking snapshots of the java process. It can be done using jmap tool. Gil/Eldad are doing it regularly.

Comment 13 Jiri Belka 2015-02-26 08:48:52 UTC
IMHO the BZ title should be changed - there is nothing nothing wrong that OOM kills a process to make kernel survive.

The BZ title should probably be - ovirt-engine using a lot of memory...

Comment 15 Liran Zelkha 2015-02-26 09:33:47 UTC
Gil - I can't access the server. Can you provide full URL?

Comment 17 Liran Zelkha 2015-03-09 07:06:03 UTC
Gil seems like the file is in an incorrect format. Can you retake it?

Comment 19 Liran Zelkha 2015-03-10 07:06:13 UTC
Hi Gil,

Only one of those dumps work (the 140MB one. The other is probably malformed). 
It reflects memory usage of 600MB. 
While there is a large object, 70MB in size, that reflects data being sent to the client (parsed to XML) - it still doesn't explain the root cause. 
How much memory was allocated to the engine? What's the amount of the machine memory?

Comment 20 Gil Klein 2015-03-11 09:49:22 UTC
(In reply to Liran Zelkha from comment #19)
> Hi Gil,
> 
> Only one of those dumps work (the 140MB one. The other is probably
> malformed). 
> It reflects memory usage of 600MB. 
> While there is a large object, 70MB in size, that reflects data being sent
> to the client (parsed to XML) - it still doesn't explain the root cause. 
> How much memory was allocated to the engine?
Seems to be the default 1GB

ovirt     1594 24.9 23.0 2895816 902660 ?      Sl   11:37   2:45 ovirt-engine -server -XX:+TieredCompilation -Xms1g -Xmx1g -XX:PermSize=256m -XX:MaxPermSize=256m -Djava.net.preferIPv4Stack=true -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000 -Djava.awt.headless=true -Djsse.enableSNIExtension=false -Djava.security.krb5.conf=/etc/ovirt-engine/krb5.conf -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/ovirt-engine/dump -Djava.util.logging.manager=org.jboss.logmanager -Dlogging.configuration=file:///var/lib/ovirt-engine/jboss_runtime/config/ovirt-engine-logging.properties -Dorg.jboss.resolver.warning=true -Djboss.modules.system.pkgs=org.jboss.byteman -Djboss.modules.write-indexes=false -Djboss.server.default.config=ovirt-engine -Djboss.home.dir=/usr/share/jbossas -Djboss.server.base.dir=/usr/share/ovirt-engine -Djboss.server.data.dir=/var/lib/ovirt-engine -Djboss.server.log.dir=/var/log/ovirt-engine -Djboss.server.config.dir=/var/lib/ovirt-engine/jboss_runtime/config -Djboss.server.temp.dir=/var/lib/ovirt-engine/jboss_runtime/tmp -Djboss.controller.temp.dir=/var/lib/ovirt-engine/jboss_runtime/tmp -jar /usr/share/jbossas/jboss-modules.jar -mp /var/lib/ovirt-engine/jboss_runtime/modules/00-ovirt-engine-modules:/var/lib/ovirt-engine/jboss_runtime/modules/01-ovirt-engine-extension-aaa-ldap-modules:/var/lib/ovirt-engine/jboss_runtime/modules/02-jbossas-modules -jaxpmodule javax.xml.jaxp-provider org.jboss.as.standalone -c ovirt-engine.xml

>What's the amount of the machine memory?
# free -m
             total       used       free     shared    buffers     cached
Mem:          3829       1853       1975         16         24        417
-/+ buffers/cache:       1411       2417
Swap:         2047          0       2047

Comment 21 Liran Zelkha 2015-03-11 13:09:03 UTC
So probably not enough. I know that on Yuri's environment we're using at least 2GB. Can you increase max heap size?

Comment 22 Gil Klein 2015-03-11 13:27:00 UTC
(In reply to Liran Zelkha from comment #21)
> So probably not enough. I know that on Yuri's environment we're using at
> least 2GB. Can you increase max heap size?
I can do it, but currently all customers are automatically configured with 1gb by default. 

We didn't see any problems when running this environment with 3.4. And this environment is not doing any scale testing (Like Yuri does).  I have a good reason to suspect that this is a regression that was introduced in 3.5.

Can you point out what is this large object and why is it growing so large?

Comment 23 Liran Zelkha 2015-03-16 19:22:21 UTC
It's not a single object. I see VM, snapshots, files, capabilities.
But since it's 600MB and your heap is 1GB I'm not sure it's related.
Can you take several snapshots (say one every 30 minutes) so I can see the overall memory behavior?

Comment 25 Liran Zelkha 2015-03-19 18:33:58 UTC
I looked at all the files from March-1st. All take ~220MB of RAM. I can't see any reason why this number would grow. 

Could it be related to a specific test?

Comment 26 Gil Klein 2015-03-25 16:08:41 UTC
Might be.

Raz, did you saw this issue again on your environments ?

Comment 27 Carlos Mestre González 2015-03-25 16:50:20 UTC
AFAIK we haven't seen it (Raz?) but we've increased the max heap size.
> ovirt-engine -server -XX:+TieredCompilation -Xms2g -Xmx4g -XX:PermSize=256m -XX:MaxPermSize=256m...

So we can either:
1) set the value to the default again and see if it fails
2) keep the max/min heap size we have now, and take snapshots every 30 minutes as suggested before, can that work?

Comment 28 Raz Tamir 2015-03-25 18:10:01 UTC
Yes, Carlos is right we havn't saw this again after the heap size increament

Comment 29 Liran Zelkha 2015-03-26 06:22:17 UTC
Let's do both - decrease memory size, and have a cron that will take a snapshot every 30 minutes. Can that work?


Note You need to log in before you can comment on or make changes to this bug.