Bug 1336664
| Summary: | Ceilometer crashes on Controller with "kernel: Out of memory" while OSP HA env runs 5000 guests | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Yuri Obshansky <yobshans> |
| Component: | openstack-ceilometer | Assignee: | Mehdi ABAAKOUK <mabaakou> |
| Status: | CLOSED NEXTRELEASE | QA Contact: | Sasha Smolyak <ssmolyak> |
| Severity: | urgent | Docs Contact: | |
| Priority: | medium | ||
| Version: | 8.0 (Liberty) | CC: | akrzos, fbaudin, jdanjou, jruzicka, mabaakou, mschuppe, nlevinki, pkilambi, sclewis, skinjo, srevivo, yobshans |
| Target Milestone: | zstream | Keywords: | Triaged, ZStream |
| Target Release: | 12.0 (Pike) | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-03-28 09:07:55 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Attachments: | |||
|
Description
Yuri Obshansky
2016-05-17 07:36:27 UTC
Created attachment 1158224 [details]
splited message log file part1
Created attachment 1158225 [details]
splited message log file part2
Created attachment 1158226 [details]
splited message log file part3
Created attachment 1158227 [details]
splited message log file part4
Created attachment 1158228 [details]
splited message log file part5
Created attachment 1158229 [details]
splited message log file part6
Created attachment 1158230 [details]
pcs status output
*** Bug 1364193 has been marked as a duplicate of this bug. *** Since OSP8, this can be fixed by configuration you can set: [oslo_messaging_rabbit] rabbit_qos_prefetch_count = <something resonable> Starting from OSP11, this value will be set automatically to an correct default value. Yuri Obshansky, Can you test the proposed configuration ? This can be fixed by configuration in OSP 8-9-10 by setting in ceilometer.conf [oslo_messaging_rabbit] rabbit_qos_prefetch_count = <something resonable like 100> This is no more necessary since OSP10. Hi Mehdi, Sorry for delay. I'm still on vacation due the relocation and I still don't have suitable environment I'll come back on November 1 and start the reproducing. I don't remove "Need info" Yuri Hi fellows, I'm still not able to reproduce it due hardware problem. We are moving BM servers to another location. So, it will take a some time. But, I remember about bug and it on high priority list. Sorry for delay. Yuri Still no platform to retest this ? Hi, I've received the hardware and started to deploy openstack. But, failed on network configuration (VLANs) So, opened tickets and wait for ... Hope, it will be resolved soon and I can start reproduce the bugs. Yuri Hi, The bug didn't reproduce on RHOS 8 (rhos-release 8 -p 2017-04-07.2) [root@overcloud-controller-0 log]# rpm -qa | grep ceilometer openstack-ceilometer-polling-5.0.5-1.el7ost.noarch openstack-ceilometer-notification-5.0.5-1.el7ost.noarch python-ceilometerclient-1.5.2-1.el7ost.noarch openstack-ceilometer-collector-5.0.5-1.el7ost.noarch python-ceilometer-5.0.5-1.el7ost.noarch openstack-ceilometer-common-5.0.5-1.el7ost.noarch openstack-ceilometer-api-5.0.5-1.el7ost.noarch openstack-ceilometer-central-5.0.5-1.el7ost.noarch openstack-ceilometer-compute-5.0.5-1.el7ost.noarch openstack-ceilometer-alarm-5.0.5-1.el7ost.noarch Only one warning found in messages log file: May 1 23:46:47 overcloud-controller-2 kernel: ------------[ cut here ]------------ May 1 23:46:47 overcloud-controller-2 kernel: WARNING: at fs/xfs/xfs_aops.c:1244 xfs_vm_releasepage+0xcb/0x100 [xfs]() May 1 23:46:47 overcloud-controller-2 kernel: Modules linked in: ip_set_hash_net ip_set nfnetlink ip6table_raw xt_comment xt_CHECKSUM vport_vxlan vxlan ip6_udp_tunnel udp_t unnel iptable_raw br_netfilter bridge stp llc iptable_mangle iptable_nat loop target_core_mod binfmt_misc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter n ls_utf8 isofs openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw iTCO_wdt gf128mul glue_helper ablk_helper cryptd iTCO_vendor_support sg dcdbas pcspkr mei_m e shpchp ipmi_devintf sb_edac edac_core mei lpc_ich acpi_power_meter acpi_pad ipmi_si ipmi_msghandler nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables May 1 23:46:47 overcloud-controller-2 kernel: xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect crct10dif_pclmul crct10dif_co mmon sysimgblt crc32c_intel fb_sys_fops ttm ahci libahci ixgbe drm igb libata dca i2c_algo_bit mdio i2c_core ptp megaraid_sas ntb pps_core wmi fjes May 1 23:46:47 overcloud-controller-2 kernel: CPU: 16 PID: 145 Comm: kswapd0 Not tainted 3.10.0-514.2.2.el7.x86_64 #1 May 1 23:46:47 overcloud-controller-2 kernel: Hardware name: Dell Inc. PowerEdge R620/0KCKR5, BIOS 2.5.4 01/22/2016 May 1 23:46:47 overcloud-controller-2 kernel: 0000000000000000 000000003d19dd54 ffff88081bb63aa0 ffffffff816860cc May 1 23:46:47 overcloud-controller-2 kernel: ffff88081bb63ad8 ffffffff81085940 ffffea00024d23e0 ffffea00024d23c0 May 1 23:46:47 overcloud-controller-2 kernel: ffff880afb4749f8 ffff88081bb63da0 ffffea00024d23c0 ffff88081bb63ae8 May 1 23:46:47 overcloud-controller-2 kernel: Call Trace: May 1 23:46:47 overcloud-controller-2 kernel: [<ffffffff816860cc>] dump_stack+0x19/0x1b May 1 23:46:47 overcloud-controller-2 kernel: [<ffffffff81085940>] warn_slowpath_common+0x70/0xb0 May 1 23:46:47 overcloud-controller-2 kernel: [<ffffffff81085a8a>] warn_slowpath_null+0x1a/0x20 May 1 23:46:47 overcloud-controller-2 kernel: [<ffffffffa036856b>] xfs_vm_releasepage+0xcb/0x100 [xfs] May 1 23:46:47 overcloud-controller-2 kernel: [<ffffffff811805c2>] try_to_release_page+0x32/0x50 May 1 23:46:47 overcloud-controller-2 kernel: [<ffffffff81196546>] shrink_active_list+0x3d6/0x3e0 May 1 23:46:47 overcloud-controller-2 kernel: [<ffffffff81196941>] shrink_lruvec+0x3f1/0x770 May 1 23:46:47 overcloud-controller-2 kernel: [<ffffffff81196d36>] shrink_zone+0x76/0x1a0 May 1 23:46:47 overcloud-controller-2 kernel: [<ffffffff81197fdc>] balance_pgdat+0x48c/0x5e0 May 1 23:46:47 overcloud-controller-2 kernel: [<ffffffff811982a3>] kswapd+0x173/0x450 May 1 23:46:47 overcloud-controller-2 kernel: [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30 May 1 23:46:47 overcloud-controller-2 kernel: [<ffffffff81198130>] ? balance_pgdat+0x5e0/0x5e0 May 1 23:46:47 overcloud-controller-2 kernel: [<ffffffff810b064f>] kthread+0xcf/0xe0 May 1 23:46:47 overcloud-controller-2 kernel: [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 May 1 23:46:47 overcloud-controller-2 kernel: [<ffffffff81696618>] ret_from_fork+0x58/0x90 May 1 23:46:47 overcloud-controller-2 kernel: [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 May 1 23:46:47 overcloud-controller-2 kernel: ---[ end trace fd6c9fbe55c6416d ]--- I don't think it related to bug I cannot wait for 48 hours due the limited time for hardware usage. I have to go ahead. I'll try reproduce it with RHOS 9 on next weekend. Attached files: pcs-status-2017-05-02.txt openstack-status-2017-05-02.txt openstack-status-overcloud-2017-05-02.txt nova-list-all-tenants-2017-05-02.txt Created attachment 1275714 [details]
pcs-status-2017-05-02.txt
Created attachment 1275715 [details]
openstack-status-2017-05-02.txt
Created attachment 1275716 [details]
openstack-status-overcloud-2017-05-02.txt
Created attachment 1275717 [details]
nova-list-all-tenants-2017-05-02.txt
Hi, The bug didn't reproduce on RHOS 9 (rhos-release 9 -p 2017-04-07.5) [stack@c02-h02-r620 ~]$ rpm -qa | grep ceilometer openstack-ceilometer-notification-6.1.3-2.el7ost.noarch openstack-ceilometer-central-6.1.3-2.el7ost.noarch openstack-ceilometer-common-6.1.3-2.el7ost.noarch openstack-ceilometer-polling-6.1.3-2.el7ost.noarch openstack-ceilometer-api-6.1.3-2.el7ost.noarch openstack-ceilometer-collector-6.1.3-2.el7ost.noarch python-ceilometer-6.1.3-2.el7ost.noarch python-ceilometer-tests-6.1.3-2.el7ost.noarch python-ceilometerclient-2.3.0-1.el7ost.noarch I leave in idle mode environment with 4925 instances running for weekend (more 60 hours) and the crash didn't happen. See attached files: pcs-status-2017-05-08.txt openstack-status-2017-05-08.txt openstack-status-overcloud-2017-05-08.txt I'll try reproduce it on RHOS 10 to make sure there is no regression. Thanks Created attachment 1277103 [details]
pcs-status-2017-05-08.txt
Created attachment 1277104 [details]
openstack-status-2017-05-08.txt
Created attachment 1277105 [details]
openstack-status-overcloud-2017-05-08.txt
As the collector will be removed in OSP12, this should not happen anymore anyway. Tagging for that release. |