Bug 1236057
Summary: | Overcloud Ceilometer stops working when turning off one controller in HA setup | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Marius Cornea <mcornea> | ||||||
Component: | openstack-puppet-modules | Assignee: | Jiri Stransky <jstransk> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Omri Hochman <ohochman> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 7.0 (Kilo) | CC: | aortega, ddomingo, eglynn, gchamoul, gfidente, hbrock, jschluet, lbezdick, mburns, mcornea, morazi, ohochman, rhel-osp-director-maint, rrosa, yeylon | ||||||
Target Milestone: | ga | ||||||||
Target Release: | 7.0 (Kilo) | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | openstack-puppet-modules-2015.1.8-8.el7ost | Doc Type: | Bug Fix | ||||||
Doc Text: |
Previously, the HAProxy configuration of the Telemetry service used incorrect checks, which caused the Telemetry service to fail in an HA deployment. Specifically, the HAProxy configuration did not have availability checks, and incorrectly used SSL checks instead of TCP.
This release fixes the checks, ensuring that the Telemetry service is correctly balanced and can launch in an HA deployment.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2015-08-05 13:28:11 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1228433 | ||||||||
Attachments: |
|
I haven't collected yet enough infos on the root cause, but I have seen this happening on a live env Dev can't seem to reproduce this issue. Can you reproduce consistently? Is there an environment we can look at where this reproduces? The change at https://code.engineering.redhat.com/gerrit/#/c/52379/ which was merged to fix BZ 1236374 removes the constraint which was restarting Ceilometer when Redis VIP relocated. This bug is most probably affected by that change and we need to see if it is reproduced using openstack-tripleo-heat-templates-0.8.6-26.el7ost I'm still able to reproduce it. This is what I get after turning off one of the first controller: [stack@instack ~]$ ceilometer meter-list ('Connection aborted.', BadStatusLine("''",)) I'm seeing the ceilometer backends in haproxy don't have checks: listen ceilometer bind 172.16.20.10:8777 bind 172.16.23.10:8777 balance roundrobin option tcplog option ssl-hello-chk server overcloud-controller-0 172.16.20.16:8777 server overcloud-controller-1 172.16.20.14:8777 server overcloud-controller-2 172.16.20.15:8777 Created attachment 1051530 [details]
ceilometer logs
Attaching ceilometer logs from one of the 2 remaining running controller nodes.
Marius, the initial query will fail until ceilometer recognizes the first mongodb node is down, logging something like: Unable to reconnect to the primary mongodb: [Errno 110] Connection timed out. Trying again in 10 seconds. at which points it will attempt a connection to the second mongodb instance and if you the same request it will work as expected. The haproxy misses the check option for ceilometer backends. We removed the node that is down from haproxy config and timeouts didn't occur anymore. Upstream patch: https://review.openstack.org/#/c/199507/2 Marius, ack, some requests were indeed failing on the "check" keyword missing With current puddle ceilometer api doesn't work right after deployment. Removing the ssl-hello-chk option makes it work again. [stack@instack ~]$ . overcloudrc [stack@instack ~]$ ceilometer meter-list ('Connection aborted.', BadStatusLine("''",)) [root@overcloud-controller-0 heat-admin]# grep -A8 ceilometer /etc/haproxy/haproxy.cfg listen ceilometer bind 172.16.20.11:8777 bind 172.16.23.10:8777 balance roundrobin option tcplog option ssl-hello-chk server overcloud-controller-0 172.16.20.14:8777 check fall 5 inter 2000 rise 2 server overcloud-controller-1 172.16.20.15:8777 check fall 5 inter 2000 rise 2 server overcloud-controller-2 172.16.20.13:8777 check fall 5 inter 2000 rise 2 [root@overcloud-controller-0 ~]# tail -5 /var/log/ceilometer/api.log 2015-07-21 04:02:24.415 17058 ERROR werkzeug [-] 172.16.20.15 - - [21/Jul/2015 04:02:24] code 400, message Bad request syntax ('\x16\x03\x00\x00y\x01\x00\x00u\x03\x00U\xad\xfc\x90HAPROXYSSLCHK') 2015-07-21 04:02:25.624 17058 ERROR werkzeug [-] 172.16.20.14 - - [21/Jul/2015 04:02:25] code 400, message Bad request syntax ('\x16\x03\x00\x00y\x01\x00\x00u\x03\x00U\xad\xfc\x91HAPROXYSSLCHK') 2015-07-21 04:02:26.419 17058 ERROR werkzeug [-] 172.16.20.13 - - [21/Jul/2015 04:02:26] code 400, message Bad request syntax ('\x16\x03\x00\x00y\x01\x00\x00u\x03\x00U\xad\xfc\x92HAPROXYSSLCHK') 2015-07-21 04:02:26.420 17058 ERROR werkzeug [-] 172.16.20.15 - - [21/Jul/2015 04:02:26] code 400, message Bad request syntax ('\x16\x03\x00\x00y\x01\x00\x00u\x03\x00U\xad\xfc\x92HAPROXYSSLCHK') 2015-07-21 04:02:27.626 17058 ERROR werkzeug [-] 172.16.20.14 - - [21/Jul/2015 04:02:27] code 400, message Bad request syntax ('\x16\x03\x00\x00y\x01\x00\x00u\x03\x00U\xad\xfc\x93HAPROXYSSLCHK') Blarrgh -- this is caused by the haproxy.cfg change that we considered backing out of core. This would be a good reason to actually back it out, given we're respinning anyway. I think James Slagle's patch should fix this: https://review.openstack.org/203298/ Tested upstream: listen ceilometer bind 192.0.2.17:8777 bind 192.0.2.18:8777 server overcloud-controller-0 192.0.2.5:8777 check fall 5 inter 2000 rise 2 server overcloud-controller-1 192.0.2.4:8777 check fall 5 inter 2000 rise 2 server overcloud-controller-2 192.0.2.3:8777 check fall 5 inter 2000 rise 2 listen glance_api bind 192.0.2.17:9292 bind 192.0.2.18:9292 option httpchk GET / server overcloud-controller-0 192.0.2.5:9292 check fall 5 inter 2000 rise 2 server overcloud-controller-1 192.0.2.4:9292 check fall 5 inter 2000 rise 2 server overcloud-controller-2 192.0.2.3:9292 check fall 5 inter 2000 rise 2 We're waiting for CI results before merge. Actually i should have pasted the glance_registry bit: listen glance_registry bind 192.0.2.18:9191 server overcloud-controller-0 192.0.2.5:9191 check fall 5 inter 2000 rise 2 server overcloud-controller-1 192.0.2.4:9191 check fall 5 inter 2000 rise 2 server overcloud-controller-2 192.0.2.3:9191 check fall 5 inter 2000 rise 2 Patch merged upstream. Note : In order to see this test pass, it require to configure fence-agent on all of the HA controllers. Verified : openstack-tripleo-puppet-elements-0.0.1-4.el7ost.noarch openstack-puppet-modules-2015.1.8-8.el7ost.noarch puppet-3.6.2-2.el7.noarch instack-undercloud-2.1.2-22.el7ost.noarch instack-0.0.7-1.el7ost.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2015:1548 |
Created attachment 1043499 [details] ceilometer logs Description of problem: Overcloud Ceilometer stops working when turning off one controller in HA setup. I'm using a virt env with 3 x controllers, 2 x computes, 3 x ceph nodes. When turnning off first controller ceilometer stops working and ceilometer related resources stop working. Version-Release number of selected component (if applicable): openstack-tripleo-image-elements-0.9.6-4.el7ost.noarch openstack-tripleo-0.0.7-0.1.1664e566.el7ost.noarch openstack-tripleo-puppet-elements-0.0.1-2.el7ost.noarch openstack-tripleo-heat-templates-0.8.6-18.el7ost.noarch openstack-tripleo-common-0.0.1.dev6-0.git49b57eb.el7ost.noarch openstack-puppet-modules-2015.1.7-5.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy overcloud with 3 controllers 2. Turn off one of the controllers 3. Check if overcloud ceilometer is working Actual results: Ceilometer stops working and cluster resources related to ceilometer fail Expected results: Ceilometer will still work after bringing one of the controllers down. Additional info: Attaching ceilometer logs and pcs status.