Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1336664

Summary: Ceilometer crashes on Controller with "kernel: Out of memory" while OSP HA env runs 5000 guests
Product: Red Hat OpenStack Reporter: Yuri Obshansky <yobshans>
Component: openstack-ceilometerAssignee: Mehdi ABAAKOUK <mabaakou>
Status: CLOSED NEXTRELEASE QA Contact: Sasha Smolyak <ssmolyak>
Severity: urgent Docs Contact:
Priority: medium    
Version: 8.0 (Liberty)CC: akrzos, fbaudin, jdanjou, jruzicka, mabaakou, mschuppe, nlevinki, pkilambi, sclewis, skinjo, srevivo, yobshans
Target Milestone: zstreamKeywords: Triaged, ZStream
Target Release: 12.0 (Pike)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-03-28 09:07:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
splited message log file part1
none
splited message log file part2
none
splited message log file part3
none
splited message log file part4
none
splited message log file part5
none
splited message log file part6
none
pcs status output
none
pcs-status-2017-05-02.txt
none
openstack-status-2017-05-02.txt
none
openstack-status-overcloud-2017-05-02.txt
none
nova-list-all-tenants-2017-05-02.txt
none
pcs-status-2017-05-08.txt
none
openstack-status-2017-05-08.txt
none
openstack-status-overcloud-2017-05-08.txt none

Description Yuri Obshansky 2016-05-17 07:36:27 UTC
Description of problem:
The problem raised when I created 5000 instances on OSP-8 HA BM environment and leave it to run in idle mode on weekend.
I came back after weekend and recognized that many services crashed and OSP was not fully functional. 1 from 3 controller is dead.  
Many errors "kernel: Out of memory: Kill process 21948 (ceilometer-coll)" 
in /var/log/messages log file.
May 13 12:10:19 overcloud-controller-1 kernel: clustercheck invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
May 13 12:10:19 overcloud-controller-1 kernel: clustercheck cpuset=/ mems_allowed=0-1
May 13 12:10:19 overcloud-controller-1 kernel: CPU: 7 PID: 25874 Comm: clustercheck Not tainted 3.10.0-327.13.1.el7.x86_64 #1
May 13 12:10:19 overcloud-controller-1 kernel: Hardware name: Dell Inc. PowerEdge R620/0KCKR5, BIOS 1.6.0 03/07/2013
May 13 12:10:19 overcloud-controller-1 kernel: ffff88099d9f7300 000000002fa9a725 ffff88014760ba68 ffffffff816356f4
May 13 12:10:19 overcloud-controller-1 kernel: ffff88014760baf8 ffffffff8163068f ffff88086e069ad0 ffff88086e069ae8
May 13 12:10:19 overcloud-controller-1 kernel: ffffffff00000206 fffeebff00000000 0000000000000001 ffffffff81128903
May 13 12:10:19 overcloud-controller-1 kernel: Call Trace:
May 13 12:10:19 overcloud-controller-1 kernel: [<ffffffff816356f4>] dump_stack+0x19/0x1b
May 13 12:10:19 overcloud-controller-1 kernel: [<ffffffff8163068f>] dump_header+0x8e/0x214
May 13 12:10:19 overcloud-controller-1 kernel: [<ffffffff81128903>] ? delayacct_end+0x53/0xb0
May 13 12:10:19 overcloud-controller-1 kernel: [<ffffffff8116ce7e>] oom_kill_process+0x24e/0x3b0
May 13 12:10:19 overcloud-controller-1 kernel: [<ffffffff81088d8e>] ? has_capability_noaudit+0x1e/0x30
May 13 12:10:19 overcloud-controller-1 kernel: [<ffffffff8116d6a6>] out_of_memory+0x4b6/0x4f0
May 13 12:10:19 overcloud-controller-1 kernel: [<ffffffff81173885>] __alloc_pages_nodemask+0xa95/0xb90
May 13 12:10:19 overcloud-controller-1 kernel: [<ffffffff811b792a>] alloc_pages_vma+0x9a/0x140
May 13 12:10:19 overcloud-controller-1 kernel: [<ffffffff81192e5b>] __do_fault+0x33b/0x510
May 13 12:10:19 overcloud-controller-1 kernel: [<ffffffff811970f8>] handle_mm_fault+0x5b8/0xf50
May 13 12:10:19 overcloud-controller-1 kernel: [<ffffffff8119e705>] ? do_mmap_pgoff+0x305/0x3c0
May 13 12:10:19 overcloud-controller-1 kernel: [<ffffffff81641380>] __do_page_fault+0x150/0x450
May 13 12:10:19 overcloud-controller-1 kernel: [<ffffffff816416a3>] do_page_fault+0x23/0x80
May 13 12:10:19 overcloud-controller-1 kernel: [<ffffffff8163d908>] page_fault+0x28/0x30
May 13 12:10:19 overcloud-controller-1 kernel: Mem-Info:
May 13 12:10:19 overcloud-controller-1 kernel: Node 0 DMA per-cpu:
May 13 12:10:19 overcloud-controller-1 kernel: CPU    0: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    1: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    2: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    3: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    4: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    5: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    6: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    7: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    8: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    9: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   10: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   11: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   12: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   13: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   14: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   15: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   16: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   17: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   18: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   19: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   20: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   21: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   22: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   23: hi:    0, btch:   1 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: Node 0 DMA32 per-cpu:
May 13 12:10:19 overcloud-controller-1 kernel: CPU    0: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    1: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    2: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    3: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    4: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    5: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    6: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    7: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    8: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    9: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   10: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   11: hi:  186, btch:  31 usd:  30
May 13 12:10:19 overcloud-controller-1 kernel: CPU   12: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   13: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   14: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   15: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   16: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   17: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   18: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   19: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   20: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   21: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   22: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   23: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: Node 0 Normal per-cpu:
May 13 12:10:19 overcloud-controller-1 kernel: CPU    0: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    1: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    2: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    3: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    4: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    5: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    6: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    7: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    8: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    9: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   10: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   11: hi:  186, btch:  31 usd:  30
May 13 12:10:19 overcloud-controller-1 kernel: CPU   12: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   13: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   14: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   15: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   16: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   17: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   18: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   19: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   20: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   21: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   22: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   23: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: Node 1 Normal per-cpu:
May 13 12:10:19 overcloud-controller-1 kernel: CPU    0: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    1: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    2: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    3: hi:  186, btch:  31 usd:   1
May 13 12:10:19 overcloud-controller-1 kernel: CPU    4: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    5: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    6: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    7: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU    8: hi:  186, btch:  31 usd:   6
May 13 12:10:19 overcloud-controller-1 kernel: CPU    9: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   10: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   11: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   12: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   13: hi:  186, btch:  31 usd:   3
May 13 12:10:19 overcloud-controller-1 kernel: CPU   14: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   15: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   16: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   17: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   18: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   19: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   20: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   21: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   22: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: CPU   23: hi:  186, btch:  31 usd:   0
May 13 12:10:19 overcloud-controller-1 kernel: active_anon:15656326 inactive_anon:9609 isolated_anon:0#012 active_file:0 inactive_file:0 isolated_file:92#012 unevictable:29568 dirty:0 writeback:0 unstable:0#012 free:55005 slab_reclaimable:150821 slab_unreclaimable:107680#012 mapped:12226 shmem:10085 pagetables:189816 bounce:0#012 free_cma:0
May 13 12:10:19 overcloud-controller-1 kernel: Node 0 DMA free:14872kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15980kB managed:15896kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
May 13 12:10:19 overcloud-controller-1 kernel: lowmem_reserve[]: 0 2767 31949 31949
May 13 12:10:19 overcloud-controller-1 kernel: Node 0 DMA32 free:120556kB min:3880kB low:4848kB high:5820kB active_anon:2470764kB inactive_anon:3256kB active_file:0kB inactive_file:0kB unevictable:300kB isolated(anon):0kB isolated(file):0kB present:3083200kB managed:2835760kB mlocked:300kB dirty:0kB writeback:0kB mapped:2704kB shmem:2624kB slab_reclaimable:134412kB slab_unreclaimable:28004kB kernel_stack:1248kB pagetables:31672kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
May 13 12:10:19 overcloud-controller-1 kernel: lowmem_reserve[]: 0 0 29181 29181
May 13 12:10:19 overcloud-controller-1 kernel: Node 0 Normal free:40720kB min:40948kB low:51184kB high:61420kB active_anon:28583100kB inactive_anon:33804kB active_file:32kB inactive_file:0kB unevictable:14084kB isolated(anon):0kB isolated(file):0kB present:30408704kB managed:29881960kB mlocked:14084kB dirty:0kB writeback:0kB mapped:40488kB shmem:37064kB slab_reclaimable:245496kB slab_unreclaimable:207976kB kernel_stack:10304kB pagetables:297376kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:23 all_unreclaimable? no
May 13 12:10:19 overcloud-controller-1 kernel: lowmem_reserve[]: 0 0 0 0
May 13 12:10:19 overcloud-controller-1 kernel: Node 1 Normal free:43872kB min:45256kB low:56568kB high:67884kB active_anon:31571440kB inactive_anon:1376kB active_file:28kB inactive_file:972kB unevictable:103888kB isolated(anon):0kB isolated(file):368kB present:33554432kB managed:33027448kB mlocked:103888kB dirty:0kB writeback:0kB mapped:5712kB shmem:652kB slab_reclaimable:223376kB slab_unreclaimable:194740kB kernel_stack:8960kB pagetables:430216kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:55 all_unreclaimable? no
May 13 12:10:19 overcloud-controller-1 kernel: lowmem_reserve[]: 0 0 0 0
May 13 12:10:19 overcloud-controller-1 kernel: Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 1*2048kB (R) 3*4096kB (M) = 14872kB
May 13 12:10:19 overcloud-controller-1 kernel: Node 0 DMA32: 1188*4kB (UE) 1567*8kB (UEM) 848*16kB (UEM) 285*32kB (UEM) 938*64kB (UEM) 77*128kB (UEM) 14*256kB (UM) 9*512kB (M) 3*1024kB (UM) 0*2048kB 0*4096kB = 121128kB
May 13 12:10:19 overcloud-controller-1 kernel: Node 0 Normal: 1092*4kB (UEM) 924*8kB (UEM) 1681*16kB (UEM) 163*32kB (EM) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 43872kB
May 13 12:10:19 overcloud-controller-1 kernel: Node 1 Normal: 733*4kB (UEM) 606*8kB (UEM) 2131*16kB (UEM) 102*32kB (UEM) 3*64kB (UE) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 45332kB
May 13 12:10:19 overcloud-controller-1 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
May 13 12:10:19 overcloud-controller-1 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
May 13 12:10:19 overcloud-controller-1 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
May 13 12:10:19 overcloud-controller-1 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
May 13 12:10:19 overcloud-controller-1 kernel: 12772 total pagecache pages
May 13 12:10:19 overcloud-controller-1 kernel: 0 pages in swap cache
May 13 12:10:19 overcloud-controller-1 kernel: Swap cache stats: add 0, delete 0, find 0/0
May 13 12:10:19 overcloud-controller-1 kernel: Free swap  = 0kB
May 13 12:10:19 overcloud-controller-1 kernel: Total swap = 0kB
May 13 12:10:19 overcloud-controller-1 kernel: 16765579 pages RAM
May 13 12:10:19 overcloud-controller-1 kernel: 0 pages HighMem/MovableOnly
....

May 13 12:10:19 overcloud-controller-1 kernel: Out of memory: Kill process 21948 (ceilometer-coll) score 534 or sacrifice child
May 13 12:10:19 overcloud-controller-1 kernel: Killed process 21948 (ceilometer-coll) total-vm:35342404kB, anon-rss:35070816kB, file-rss:2532kB
May 13 12:10:19 overcloud-controller-1 ceilometer-alarm-evaluator: 2016-05-13 12:10:19.467 22625 WARNING oslo.service.loopingcall [-] Function 'ceilometer.coordination.PartitionCoordinator.heartbeat' run outlasted interval by 0.13 sec

Version-Release number of selected component (if applicable):
rhos-release 8 -p 2016-04-18.1

How reproducible:
(HA, 3 controller nodes and 6 compute nodes where each is 24 CPUs and 64 G RAM)
Created 5000 instances
- image: cirros (12.6 MB)
- flavor: m1.nano (1VCPU, 64MB RAM, 1GB Disk)

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Yuri Obshansky 2016-05-17 09:17:26 UTC
Created attachment 1158224 [details]
splited message log file part1

Comment 3 Yuri Obshansky 2016-05-17 09:18:29 UTC
Created attachment 1158225 [details]
splited message log file part2

Comment 4 Yuri Obshansky 2016-05-17 09:19:26 UTC
Created attachment 1158226 [details]
splited message log file part3

Comment 5 Yuri Obshansky 2016-05-17 09:20:25 UTC
Created attachment 1158227 [details]
splited message log file part4

Comment 6 Yuri Obshansky 2016-05-17 09:21:26 UTC
Created attachment 1158228 [details]
splited message log file part5

Comment 7 Yuri Obshansky 2016-05-17 09:22:20 UTC
Created attachment 1158229 [details]
splited message log file part6

Comment 8 Yuri Obshansky 2016-05-17 09:22:49 UTC
Created attachment 1158230 [details]
pcs status output

Comment 9 Julien Danjou 2016-09-02 14:51:45 UTC
*** Bug 1364193 has been marked as a duplicate of this bug. ***

Comment 10 Mehdi ABAAKOUK 2016-10-11 16:27:25 UTC
Since OSP8, this can be fixed by configuration you can set:

[oslo_messaging_rabbit]
rabbit_qos_prefetch_count = <something resonable>


Starting from OSP11, this value will be set automatically to an correct default value.

Comment 11 Mehdi ABAAKOUK 2016-10-12 11:36:49 UTC
Yuri Obshansky, Can you test the proposed configuration ?

Comment 12 Mehdi ABAAKOUK 2016-10-24 12:51:28 UTC
This can be fixed by configuration in OSP 8-9-10 by setting in ceilometer.conf

[oslo_messaging_rabbit]
rabbit_qos_prefetch_count = <something resonable like 100>

This is no more necessary since OSP10.

Comment 13 Yuri Obshansky 2016-10-26 10:26:16 UTC
Hi Mehdi, 
Sorry for delay. I'm still on vacation due the relocation 
and I still don't have suitable environment 
I'll come back on November 1 and start the reproducing.
I don't remove "Need info"
Yuri

Comment 14 Yuri Obshansky 2016-12-06 17:26:25 UTC
Hi fellows,
I'm still not able to reproduce it due hardware problem.
We are moving BM servers to another location.
So, it will take a some time.
But, I remember about bug and it on high priority list.
Sorry for delay.
Yuri

Comment 15 Mehdi ABAAKOUK 2017-01-23 10:34:39 UTC
Still no platform to retest this ?

Comment 16 Yuri Obshansky 2017-01-24 15:39:30 UTC
Hi, 
I've received the hardware and started to deploy openstack.
But, failed on network configuration (VLANs)
So, opened tickets and wait for ...
Hope, it will be resolved soon and I can start reproduce the bugs. 
Yuri

Comment 17 Yuri Obshansky 2017-05-02 15:49:11 UTC
Hi, 
The bug didn't reproduce on RHOS 8 (rhos-release 8 -p 2017-04-07.2)
[root@overcloud-controller-0 log]# rpm -qa | grep ceilometer
openstack-ceilometer-polling-5.0.5-1.el7ost.noarch
openstack-ceilometer-notification-5.0.5-1.el7ost.noarch
python-ceilometerclient-1.5.2-1.el7ost.noarch
openstack-ceilometer-collector-5.0.5-1.el7ost.noarch
python-ceilometer-5.0.5-1.el7ost.noarch
openstack-ceilometer-common-5.0.5-1.el7ost.noarch
openstack-ceilometer-api-5.0.5-1.el7ost.noarch
openstack-ceilometer-central-5.0.5-1.el7ost.noarch
openstack-ceilometer-compute-5.0.5-1.el7ost.noarch
openstack-ceilometer-alarm-5.0.5-1.el7ost.noarch

Only one warning found in messages log file:
May  1 23:46:47 overcloud-controller-2 kernel: ------------[ cut here ]------------
May  1 23:46:47 overcloud-controller-2 kernel: WARNING: at fs/xfs/xfs_aops.c:1244 xfs_vm_releasepage+0xcb/0x100 [xfs]()
May  1 23:46:47 overcloud-controller-2 kernel: Modules linked in: ip_set_hash_net ip_set nfnetlink ip6table_raw xt_comment xt_CHECKSUM vport_vxlan vxlan ip6_udp_tunnel udp_t
unnel iptable_raw br_netfilter bridge stp llc iptable_mangle iptable_nat loop target_core_mod binfmt_misc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter n
ls_utf8 isofs openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack intel_powerclamp coretemp intel_rapl 
iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw iTCO_wdt gf128mul glue_helper ablk_helper cryptd iTCO_vendor_support sg dcdbas pcspkr mei_m
e shpchp ipmi_devintf sb_edac edac_core mei lpc_ich acpi_power_meter acpi_pad ipmi_si ipmi_msghandler nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables
May  1 23:46:47 overcloud-controller-2 kernel: xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect crct10dif_pclmul crct10dif_co
mmon sysimgblt crc32c_intel fb_sys_fops ttm ahci libahci ixgbe drm igb libata dca i2c_algo_bit mdio i2c_core ptp megaraid_sas ntb pps_core wmi fjes
May  1 23:46:47 overcloud-controller-2 kernel: CPU: 16 PID: 145 Comm: kswapd0 Not tainted 3.10.0-514.2.2.el7.x86_64 #1
May  1 23:46:47 overcloud-controller-2 kernel: Hardware name: Dell Inc. PowerEdge R620/0KCKR5, BIOS 2.5.4 01/22/2016
May  1 23:46:47 overcloud-controller-2 kernel: 0000000000000000 000000003d19dd54 ffff88081bb63aa0 ffffffff816860cc
May  1 23:46:47 overcloud-controller-2 kernel: ffff88081bb63ad8 ffffffff81085940 ffffea00024d23e0 ffffea00024d23c0
May  1 23:46:47 overcloud-controller-2 kernel: ffff880afb4749f8 ffff88081bb63da0 ffffea00024d23c0 ffff88081bb63ae8
May  1 23:46:47 overcloud-controller-2 kernel: Call Trace:
May  1 23:46:47 overcloud-controller-2 kernel: [<ffffffff816860cc>] dump_stack+0x19/0x1b
May  1 23:46:47 overcloud-controller-2 kernel: [<ffffffff81085940>] warn_slowpath_common+0x70/0xb0
May  1 23:46:47 overcloud-controller-2 kernel: [<ffffffff81085a8a>] warn_slowpath_null+0x1a/0x20
May  1 23:46:47 overcloud-controller-2 kernel: [<ffffffffa036856b>] xfs_vm_releasepage+0xcb/0x100 [xfs]
May  1 23:46:47 overcloud-controller-2 kernel: [<ffffffff811805c2>] try_to_release_page+0x32/0x50
May  1 23:46:47 overcloud-controller-2 kernel: [<ffffffff81196546>] shrink_active_list+0x3d6/0x3e0
May  1 23:46:47 overcloud-controller-2 kernel: [<ffffffff81196941>] shrink_lruvec+0x3f1/0x770
May  1 23:46:47 overcloud-controller-2 kernel: [<ffffffff81196d36>] shrink_zone+0x76/0x1a0
May  1 23:46:47 overcloud-controller-2 kernel: [<ffffffff81197fdc>] balance_pgdat+0x48c/0x5e0
May  1 23:46:47 overcloud-controller-2 kernel: [<ffffffff811982a3>] kswapd+0x173/0x450
May  1 23:46:47 overcloud-controller-2 kernel: [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30
May  1 23:46:47 overcloud-controller-2 kernel: [<ffffffff81198130>] ? balance_pgdat+0x5e0/0x5e0
May  1 23:46:47 overcloud-controller-2 kernel: [<ffffffff810b064f>] kthread+0xcf/0xe0
May  1 23:46:47 overcloud-controller-2 kernel: [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
May  1 23:46:47 overcloud-controller-2 kernel: [<ffffffff81696618>] ret_from_fork+0x58/0x90
May  1 23:46:47 overcloud-controller-2 kernel: [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
May  1 23:46:47 overcloud-controller-2 kernel: ---[ end trace fd6c9fbe55c6416d ]---

I don't think it related to bug

I cannot wait for 48 hours due the limited time for hardware usage.
I have to go ahead.
I'll try reproduce it with RHOS 9 on next weekend.

Attached files:
pcs-status-2017-05-02.txt
openstack-status-2017-05-02.txt
openstack-status-overcloud-2017-05-02.txt
nova-list-all-tenants-2017-05-02.txt

Comment 18 Yuri Obshansky 2017-05-02 15:50:10 UTC
Created attachment 1275714 [details]
pcs-status-2017-05-02.txt

Comment 19 Yuri Obshansky 2017-05-02 15:50:42 UTC
Created attachment 1275715 [details]
openstack-status-2017-05-02.txt

Comment 20 Yuri Obshansky 2017-05-02 15:51:10 UTC
Created attachment 1275716 [details]
openstack-status-overcloud-2017-05-02.txt

Comment 21 Yuri Obshansky 2017-05-02 15:51:38 UTC
Created attachment 1275717 [details]
nova-list-all-tenants-2017-05-02.txt

Comment 22 Yuri Obshansky 2017-05-08 13:33:16 UTC
Hi, 
The bug didn't reproduce on RHOS 9 (rhos-release 9   -p 2017-04-07.5)
[stack@c02-h02-r620 ~]$ rpm -qa | grep ceilometer
openstack-ceilometer-notification-6.1.3-2.el7ost.noarch
openstack-ceilometer-central-6.1.3-2.el7ost.noarch
openstack-ceilometer-common-6.1.3-2.el7ost.noarch
openstack-ceilometer-polling-6.1.3-2.el7ost.noarch
openstack-ceilometer-api-6.1.3-2.el7ost.noarch
openstack-ceilometer-collector-6.1.3-2.el7ost.noarch
python-ceilometer-6.1.3-2.el7ost.noarch
python-ceilometer-tests-6.1.3-2.el7ost.noarch
python-ceilometerclient-2.3.0-1.el7ost.noarch

I leave in idle mode environment with 4925 instances running for weekend 
(more 60 hours) and the crash didn't happen.
See attached files: 
pcs-status-2017-05-08.txt
openstack-status-2017-05-08.txt
openstack-status-overcloud-2017-05-08.txt

I'll try reproduce it on RHOS 10 to make sure there is no regression.
Thanks

Comment 23 Yuri Obshansky 2017-05-08 13:34:07 UTC
Created attachment 1277103 [details]
pcs-status-2017-05-08.txt

Comment 24 Yuri Obshansky 2017-05-08 13:34:41 UTC
Created attachment 1277104 [details]
openstack-status-2017-05-08.txt

Comment 25 Yuri Obshansky 2017-05-08 13:35:21 UTC
Created attachment 1277105 [details]
openstack-status-overcloud-2017-05-08.txt

Comment 26 Julien Danjou 2017-09-18 06:59:19 UTC
As the collector will be removed in OSP12, this should not happen anymore anyway. Tagging for that release.