Bug 903432

Summary:	Qemu process is oom killed despite 11GB of free memory, cgroup limit needs to be relaxed
Product:	[Fedora] Fedora	Reporter:	William Brown <william>
Component:	libvirt	Assignee:	Libvirt Maintainers <libvirt-maint>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	18	CC:	amit.shah, berrange, cfergeau, clalancette, crobinso, dwmw2, itamar, jforbes, jyang, laine, libvirt-maint, pbonzini, rjones, scottt.tw, veillard, virt-maint
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-03-04 17:16:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description William Brown 2013-01-24 00:54:37 UTC

Description of problem:
Jan 24 11:17:02 franky. kernel: qemu-kvm invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
Jan 24 11:17:02 franky. kernel: qemu-kvm cpuset=emulator mems_allowed=0
Jan 24 11:17:02 franky. kernel: Pid: 13888, comm: qemu-kvm Tainted: GF       W    3.7.2-204.fc18.x86_64 #1
Jan 24 11:17:02 franky. kernel: Call Trace:
Jan 24 11:17:02 franky. kernel:  [<ffffffff810d2a81>] ? cpuset_print_task_mems_allowed+0x91/0xa0
Jan 24 11:17:02 franky. kernel:  [<ffffffff8162c8b4>] dump_header+0x79/0x1b3
Jan 24 11:17:02 franky. kernel:  [<ffffffff811330e7>] ? find_lock_task_mm+0x27/0x70
Jan 24 11:17:02 franky. kernel:  [<ffffffff8118af70>] ? try_get_mem_cgroup_from_mm+0x50/0x70
Jan 24 11:17:02 franky. kernel:  [<ffffffff812ec783>] ? ___ratelimit+0xa3/0x120
Jan 24 11:17:02 franky. kernel:  [<ffffffff811334b7>] oom_kill_process+0x1c7/0x310
Jan 24 11:17:02 franky. kernel:  [<ffffffff8106acd5>] ? has_ns_capability_noaudit+0x15/0x20
Jan 24 11:17:02 franky. kernel:  [<ffffffff8118dcdf>] mem_cgroup_out_of_memory+0x29f/0x2c0
Jan 24 11:17:02 franky. kernel:  [<ffffffff8118e473>] __mem_cgroup_try_charge+0x773/0x8d0
Jan 24 11:17:02 franky. kernel:  [<ffffffff814c265a>] ? dm_get_live_table+0x4a/0x60
Jan 24 11:17:02 franky. kernel:  [<ffffffff8118d6c0>] ? mem_cgroup_write+0x3b0/0x3b0
Jan 24 11:17:02 franky. kernel:  [<ffffffff8118ed9e>] mem_cgroup_charge_common+0x8e/0x120
Jan 24 11:17:02 franky. kernel:  [<ffffffff8118f86a>] mem_cgroup_cache_charge+0x7a/0xa0
Jan 24 11:17:02 franky. kernel:  [<ffffffff81130447>] add_to_page_cache_locked+0x67/0x1a0
Jan 24 11:17:02 franky. kernel:  [<ffffffff8113059a>] add_to_page_cache_lru+0x1a/0x40
Jan 24 11:17:02 franky. kernel:  [<ffffffff812fb7e1>] ? list_del+0x11/0x40
Jan 24 11:17:02 franky. kernel:  [<ffffffff811d0845>] mpage_readpages+0xa5/0x140
Jan 24 11:17:02 franky. kernel:  [<ffffffff81219b80>] ? noalloc_get_block_write+0x30/0x30
Jan 24 11:17:02 franky. kernel:  [<ffffffff81219b80>] ? noalloc_get_block_write+0x30/0x30
Jan 24 11:17:02 franky. kernel:  [<ffffffff8112f7c7>] ? __page_cache_alloc+0xd7/0x100
Jan 24 11:17:02 franky. kernel:  [<ffffffff812158fd>] ext4_readpages+0x1d/0x20
Jan 24 11:17:02 franky. kernel:  [<ffffffff8113c0e7>] __do_page_cache_readahead+0x1b7/0x240
Jan 24 11:17:02 franky. kernel:  [<ffffffff8113c491>] ra_submit+0x21/0x30
Jan 24 11:17:02 franky. kernel:  [<ffffffff8113c5b5>] ondemand_readahead+0x115/0x240
Jan 24 11:17:02 franky. kernel:  [<ffffffff8113c7b3>] page_cache_sync_readahead+0x33/0x50
Jan 24 11:17:02 franky. kernel:  [<ffffffff81131a10>] generic_file_aio_read+0x540/0x6c0
Jan 24 11:17:02 franky. kernel:  [<ffffffff81194b27>] do_sync_read+0xa7/0xe0
Jan 24 11:17:02 franky. kernel:  [<ffffffff81195279>] vfs_read+0xa9/0x180
Jan 24 11:17:02 franky. kernel:  [<ffffffff8119552a>] sys_pread64+0x9a/0xb0
Jan 24 11:17:02 franky. kernel:  [<ffffffff8163ded9>] system_call_fastpath+0x16/0x1b
Jan 24 11:17:02 franky. kernel: Task in /libvirt/qemu/Server2008r2 killed as a result of limit of /libvirt/qemu/Server2008r2
Jan 24 11:17:02 franky. kernel: memory: usage 4549836kB, limit 4549836kB, failcnt 565605
Jan 24 11:17:02 franky. kernel: memory+swap: usage 0kB, limit 9007199254740991kB, failcnt 0
Jan 24 11:17:02 franky. kernel: Mem-Info:
Jan 24 11:17:02 franky. kernel: Node 0 DMA per-cpu:
Jan 24 11:17:02 franky. kernel: CPU    0: hi:    0, btch:   1 usd:   0
Jan 24 11:17:02 franky. kernel: CPU    1: hi:    0, btch:   1 usd:   0
Jan 24 11:17:02 franky. kernel: CPU    2: hi:    0, btch:   1 usd:   0
Jan 24 11:17:02 franky. kernel: CPU    3: hi:    0, btch:   1 usd:   0
Jan 24 11:17:02 franky. kernel: CPU    4: hi:    0, btch:   1 usd:   0
Jan 24 11:17:02 franky. kernel: CPU    5: hi:    0, btch:   1 usd:   0
Jan 24 11:17:02 franky. kernel: CPU    6: hi:    0, btch:   1 usd:   0
Jan 24 11:17:02 franky. kernel: CPU    7: hi:    0, btch:   1 usd:   0
Jan 24 11:17:02 franky. kernel: Node 0 DMA32 per-cpu:
Jan 24 11:17:02 franky. kernel: CPU    0: hi:  186, btch:  31 usd: 195
Jan 24 11:17:02 franky. kernel: CPU    1: hi:  186, btch:  31 usd: 192
Jan 24 11:17:02 franky. kernel: CPU    2: hi:  186, btch:  31 usd:   0
Jan 24 11:17:02 franky. kernel: CPU    3: hi:  186, btch:  31 usd:  18
Jan 24 11:17:02 franky. kernel: CPU    4: hi:  186, btch:  31 usd:  30
Jan 24 11:17:02 franky. kernel: CPU    5: hi:  186, btch:  31 usd: 176
Jan 24 11:17:02 franky. kernel: CPU    6: hi:  186, btch:  31 usd:   0
Jan 24 11:17:02 franky. kernel: CPU    7: hi:  186, btch:  31 usd:  30
Jan 24 11:17:02 franky. kernel: Node 0 Normal per-cpu:
Jan 24 11:17:02 franky. kernel: CPU    0: hi:  186, btch:  31 usd:  71
Jan 24 11:17:02 franky. kernel: CPU    1: hi:  186, btch:  31 usd:  91
Jan 24 11:17:02 franky. kernel: CPU    2: hi:  186, btch:  31 usd: 146
Jan 24 11:17:02 franky. kernel: CPU    3: hi:  186, btch:  31 usd: 130
Jan 24 11:17:02 franky. kernel: CPU    4: hi:  186, btch:  31 usd: 110
Jan 24 11:17:02 franky. kernel: CPU    5: hi:  186, btch:  31 usd: 130
Jan 24 11:17:02 franky. kernel: CPU    6: hi:  186, btch:  31 usd:  49
Jan 24 11:17:02 franky. kernel: CPU    7: hi:  186, btch:  31 usd:  52
Jan 24 11:17:02 franky. kernel: [291B blob data]
Jan 24 11:17:02 franky. kernel: Node 0 DMA free:15888kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):...nreclaimable? no
Jan 24 11:17:02 franky. kernel: lowmem_reserve[]: 0 2139 15997 15997
Jan 24 11:17:02 franky. kernel: Node 0 DMA32 free:2175560kB min:9028kB low:11284kB high:13540kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB iso...nreclaimable? no
Jan 24 11:17:02 franky. kernel: lowmem_reserve[]: 0 0 13858 13858
Jan 24 11:17:02 franky. kernel: Node 0 Normal free:4780740kB min:58488kB low:73108kB high:87732kB active_anon:5275760kB inactive_anon:206408kB active_file:1650712kB inactive_file:1709500kB unevictable:4...nreclaimable? no
Jan 24 11:17:02 franky. kernel: lowmem_reserve[]: 0 0 0 0
Jan 24 11:17:02 franky. kernel: Node 0 DMA: 2*4kB 1*8kB 2*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15888kB
Jan 24 11:17:02 franky. kernel: Node 0 DMA32: 14*4kB 10*8kB 10*16kB 9*32kB 16*64kB 8*128kB 12*256kB 6*512kB 4*1024kB 6*2048kB 525*4096kB = 2175560kB
Jan 24 11:17:02 franky. kernel: Node 0 Normal: 368*4kB 591*8kB 283*16kB 157*32kB 421*64kB 601*128kB 237*256kB 162*512kB 177*1024kB 23*2048kB 1047*4096kB = 4780104kB
Jan 24 11:17:02 franky. kernel: 862236 total pagecache pages
Jan 24 11:17:02 franky. kernel: 7631 pages in swap cache
Jan 24 11:17:02 franky. kernel: Swap cache stats: add 18439, delete 10808, find 3479/3479
Jan 24 11:17:02 franky. kernel: Free swap  = 8145088kB
Jan 24 11:17:02 franky. kernel: Total swap = 8175612kB
Jan 24 11:17:02 franky. kernel: 4193776 pages RAM
Jan 24 11:17:02 franky. kernel: 107207 pages reserved
Jan 24 11:17:02 franky. kernel: 2680404 pages shared
Jan 24 11:17:02 franky. kernel: 1862358 pages non-shared
Jan 24 11:17:02 franky. kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Jan 24 11:17:02 franky. kernel: [13623]   107 13623  1275536  1059312    2236     6839             0 qemu-kvm
Jan 24 11:17:02 franky. kernel: Memory cgroup out of memory: Kill process 13993 (qemu-kvm) score 335 or sacrifice child
Jan 24 11:17:02 franky. kernel: Killed process 13993 (qemu-kvm) total-vm:5102144kB, anon-rss:4226008kB, file-rss:11240kB


Free reports:

             total       used       free     shared    buffers     cached
Mem:      16346276    5134592   11211684          0     276768    3143840
-/+ buffers/cache:    1713984   14632292
Swap:      8175612          0    8175612


The VM in question was allocated 4096MB of ram. 

Where is this cgroup limit coming from?

Comment 1 William Brown 2013-01-24 01:15:06 UTC

It appears that setting disk cache mode to "None" corrects this issue.

Comment 2 Richard W.M. Jones 2013-01-24 08:35:24 UTC

There was a bug in libvirt (fixed upstream) where it set
the cgroup memory limit too low causing qemu to die under
load.  I believe the fix for this is:

commit 3c83df679e8feab939c08b1f97c48f9290a5b8cd
Author: Michal Privoznik <mprivozn>
Date:   Tue Jan 8 10:15:49 2013 +0100

    qemu: Relax hard RSS limit
    
    Currently, if there's no hard memory limit defined for a domain,
    libvirt tries to calculate one, based on domain definition and magic
    equation and set it upon the domain startup. The rationale behind was,
    if there's a memory leak or exploit in qemu, we should prevent the
    host system trashing. However, the equation was too tightening, as it
    didn't reflect what the kernel counts into the memory used by a
    process. Since many hosts do have a swap, nobody hasn't noticed
    anything, because if hard memory limit is reached, process can
    continue allocating memory on a swap. However, if there is no swap on
    the host, the process gets killed by OOM killer. In our case, the qemu
    process it is.
    
    To prevent this, we need to relax the hard RSS limit. Moreover, we
    should reflect more precisely the kernel way of accounting the memory
    for process. That is, even the kernel caches are counted within the
    memory used by a process (within cgroups at least). Hence the magic
    equation has to be changed:
    
      limit = 1.5 * (domain memory + total video memory) + (32MB for cache
              per each disk) + 200MB

(By the way -- the description of this bug is wrong.  It *does*
affect you even if you have swap).


There's also bug 891324 in libvirtd (actually, in libselinux)
where a memory leak causes libvirtd to grow and grow in size
until there's no memory left to run guests; however that's
probably not the bug you are hitting.

Comment 3 Cole Robinson 2013-01-24 22:55:34 UTC

Thanks Rich, that sounds like the culprit.

Comment 4 Fedora Update System 2013-01-28 20:46:42 UTC

libvirt-0.10.2.3-1.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/libvirt-0.10.2.3-1.fc18

Comment 5 Fedora Update System 2013-02-05 16:53:49 UTC

libvirt-0.10.2.3-1.fc18 has been pushed to the Fedora 18 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 6 Richard W.M. Jones 2013-05-24 10:41:23 UTC

Possibly this bug has started to happen again (bug 966939).